Skip to content

SepRAG: CCH-inspired retrieval exploration + customizable re-weighting for self-learning ANN #534

Description

@shaal

Summary

Exploration of CCH-inspired hierarchical retrieval ("SepRAG") for RuVector's hybrid
vector + knowledge-graph memory, run as a disciplined, evidence-gated research thread. It
produced one negative result (recorded honestly) and one proven, shippable result.

This is a tracking/epic issue. The first deliverable (ADRs 196–200 + a reference crate +
six experiment harnesses) is in the companion PR. The remaining bets stay open below.

Measured impact (proven, with scope)

A new capability — adapt the ANN index to a changed relevance metric by re-weighting
the existing graph instead of rebuilding it. It does not change query latency or
absolute recall of existing search; it removes the rebuild cost a metric change otherwise
incurs.

  • Update cost (metric change), reference Vamana: 141.8s → 0.035s at n=100k (~4,000×
    lower); ratio ~1,000–4,000× across n=5k–100k (rebuild is super-linear; re-weight ≈ a
    medoid recompute).
  • Production ruvector-diskann Vamana (n=20k): rebuild ~7s → re-weight ~0.01s, recall
    within 2% of rebuild (96–99% absolute).
  • Recall preserved: within 2% of full rebuild across diagonal / rotational / non-linear
    / region-local / compounding drift, up to n=100k.
  • Hybrid operating point: re-weight every step + rebuild every ~4 → 98.8% vs 99.1%
    (always) / 94.4% (never), at 25% of rebuild cost.
  • Not yet proven: drift is synthetic (parametric), not a live ruvector-gnn
    learned-metric trajectory; production-loop integration is backlog item Implement Ruvector high-performance vector database #1. No query-latency
    or recall improvement to existing search.

Motivation

Customizable Contraction Hierarchies (CCH) give ~35,000× speedups on road networks by
pre-processing a separator hierarchy. The question: can the same machinery (nested
dissection, separators, contraction shortcuts, elimination trees, separator-tree k-NN, and
metric-independent + customizable preprocessing) give large search-space reduction and
cheap self-learning re-weighting for embedding + KG retrieval?

Findings

❌ CCH full-contraction does NOT fit embedding retrieval (NO-GO, measured)

Validated implementation (ruvector-seprag, separator-tree k-NN == Dijkstra oracle on
toy + real graphs), then measured on real public data:

Backbone (n=1500) blowup |G+|/|G_nav| elim-tree height
roadNet-PA (control) 7.6× ~3.5·√n
ogbn-arxiv citation 23.8× ~0.6·n
ogbn-arxiv feature kNN 42.4× ~0.56·n

The road control proves the implementation is sound; both embedding-derived backbones are
intrinsically high-treewidth, so contraction blows up. HNSW already owns embedding kNN.
(ADR-199.)

✅ Customizable re-weighting (the salvaged idea) — proven WIN

Decoupled from CCH: in a self-learning store the metric drifts; instead of rebuilding the
ANN index, reuse the fixed topology and re-score under the new metric. Validated on the
production ruvector-diskann Vamana:

  • Recall within 2% of a full rebuild across diagonal / rotational / non-linear /
    region-local / compounding
    drift, up to n=100k, at ~1,000–4,000× lower update cost.
  • Hybrid operating policy: re-weight every step + rebuild every ~4 steps → 98.8% recall
    (vs 99.1% always-rebuild, 94.4% never) at 25% of the rebuild cost
    .
  • Honest caveats: the recall gap widens with scale/churn (defer/batch rebuilds, not never);
    a drift-triggered rebuild monitor underperformed simple periodic (needs a better signal).

(ADR-200.)

Deliverables (companion PR)

  • ADR-196…200 (docs/adr/) — the full decision record (one dead end + one win).
  • docs/plans/seprag-cch-retrieval/ — milestone plans, FUTURE-DIRECTIONS.md (backlog +
    "prove-not-hype" protocol).
  • crates/ruvector-seprag/ — reference CCH separator-tree k-NN + shared ANN engine +
    6 experiment harnesses (reweight_vs_rebuild, scale_drift, region_drift,
    diskann_drift, hybrid_policy, blowup_report). Tests + clippy green.

Backlog (open — same prove-not-hype protocol: one claim+number, beat a tuned in-repo

incumbent, public data, pre-registered win/kill gate, adversarial check)

  • Productionize the win — wire "re-weight + periodic rebuild" into the
    ruvector-diskann/ruvector-gnn loop behind a flag, on a real learned-metric trajectory.
  • Smarter rebuild trigger — sampled-recall probe vs the Frobenius monitor.
  • BET 2 — filtered ANN (predicate-constrained) vs ruvector-acorn.
  • BET 3 — multi-hop Graph-RAG on a sparse curated KG (HotpotQA / MuSiQue ground truth).
  • BET 4 — region-pruning query on an IVF/clustering hierarchy (rairs-ivf, treewidth-immune).

Reproduce

Datasets are public (ogbn-arxiv, SNAP roadNet-PA). Each harness prints its pre-registered
gate. See docs/plans/seprag-cch-retrieval/FUTURE-DIRECTIONS.md for the protocol.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions