Skip to content

Graph backend + retrieval#49

Merged
johnsonr merged 22 commits into
embabel:mainfrom
jimador:feat/graph-backend-and-retrieval
Jun 27, 2026
Merged

Graph backend + retrieval#49
johnsonr merged 22 commits into
embabel:mainfrom
jimador:feat/graph-backend-and-retrieval

Conversation

@jimador

@jimador jimador commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Graph backend + retrieval

Third of the four-PR stack, and the largest — the E2E harness and the report / ingestion modules
are coupled to graph-query, so the whole retrieval layer ships together. Stacked on #48 — review
after #47 and #48 merge; until then the diff includes their commits.

Carves out three modules: dice-report, dice-ingestion, dice-integration-tests.

What's in it

  • Make graph projection traceable and self-healing (Make graph projection traceable and self-healing #40) — every projected edge carries lineage
    back to its source propositions and their authority tier, a stale-cascade invalidates dependent
    projections when a source changes, and each projection reports a success / skip / fail outcome.
  • Let the graph explain itself (Let the graph explain itself #41) — three output projectors: a rationale citing evidence, a
    structured report over a set of propositions, and a SemanticLinkDiscoverer that surfaces
    non-obvious multi-hop connections direct queries would miss.
  • Get source material in, and keep it in a durable graph (Get source material in, and keep it in a durable graph #42) — an ingestion SPI (handlers →
    chunks, a content-hash ledger dedups before extraction), a reconciler that looks up existing graph
    artifacts before creating new ones, and a Neo4j adapter on embabel-agent's RAG
    NamedEntityDataRepository.
  • Ask the graph questions — and let agents ask too (Ask the graph questions — and let agents ask too #43) — a store-agnostic graph query API
    (neighborhood / path / lineage) behind a capability interface, exposed as @LlmTool agent tools.
    Queries respect edge authority, so low-trust edges can be filtered at query time.
  • Route each question to the right retrieval mode (Route each question to the right retrieval mode #44) — a router that picks or combines
    retrieval modes (vector, entity, graph-walk, temporal, hybrid) behind one entry point, exposed
    over REST and as agent tools, with DTOs keeping internal types off the wire.

Resolves pre-existing design issues

Related but not closed here: #7 (extraction context metadata).

Closes #40
Closes #41
Closes #42
Closes #43
Closes #44
Closes #18
Closes #1
Closes #32

@jimador jimador requested review from jasperblues and johnsonr June 19, 2026 12:28
@jimador jimador force-pushed the feat/graph-backend-and-retrieval branch 8 times, most recently from 54946d7 to 78d9acc Compare June 22, 2026 16:45
@jimador jimador force-pushed the feat/graph-backend-and-retrieval branch from f4f845b to e923dd0 Compare June 22, 2026 19:33
@jimador jimador force-pushed the feat/graph-backend-and-retrieval branch 3 times, most recently from 384734d to ecae4ef Compare June 24, 2026 06:10
jimador added 13 commits June 24, 2026 02:26
…uthority

Makes projection to a graph observable and traceable, and carries source
authority through to the projected edges.

- GraphProjectionService and RelationBasedGraphProjector report why a projection
  was skipped or failed instead of failing silently; Projection carries structured
  failure reasons
- ProjectionRecord lineage records what projected where; ProjectionLineageStaleCascade
  cascades a stale mark to everything a stale proposition projected
- ProjectionPolicySupport carries a proposition's source authority onto the
  projected edge

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…ctors

Adds projectors that turn propositions into output for people rather than graphs.

- RationaleProjector / LlmRationaleProjector explain why a conclusion holds
- ReportProjector / StructuredReportProjector assemble a structured report
- SemanticLink / SemanticLinkDiscoverer surface non-obvious two-hop connections

Also fills out the KDoc on AlwaysCreateEntityResolver and EscalatingEntityResolver
so the default resolver behavior is documented at the call site.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…estion

Wires the projection-lineage SPI to a real repository, adds a reference Neo4j
adapter, and gives DICE one front door for getting source material in.

- RepositoryBackedReconciler resolves whether to create or reuse a graph artifact
  by looking it up, instead of always creating — what keeps re-projection from
  duplicating nodes; comes with a seeded-graph integration proof
- Neo4jRagPropositionRepository is a reference store backed by the RAG entity
  store, declaring only the capability fragments it honestly supports
- ingestion SPI: IngestionHandler / TextIngestionHandler turn artifacts into
  chunks; IngestionLedger dedups by content hash; IngestionResult reports per batch
- Testcontainers test-scope deps and a Docker Engine api.version pin for the proof

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Adds the read side over the graph: ask for an entity's neighborhood, the path
between two entities, or why a proposition is believed.

- GraphQuery with GraphNeighborhood, GraphPath, and PropositionLineage
- GraphQueryCapable is the store fragment that answers these, and can filter by
  the source authority carried on each edge
- GraphQueryTools exposes the queries as agent tools

Includes the canonical-flow harness that runs the whole extract -> resolve ->
project -> query path end to end without an LLM or a database.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…tools

Adds a router that picks how to retrieve for a given query, with REST and agent
surfaces over it.

- RetrievalMode and RetrievalRouter route a query to the right strategy, including
  a hybrid mode that combines vector and graph
- DiscoveryQuery / DiscoveryDtos, DiscoveryController, and DiscoveryTools expose
  the router over REST and as agent tools, registered through DiceRestConfiguration

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…ests modules

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
… cross-module refs

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
DICE shipped only in-memory ProjectionRecordStore and CollectorRecordStore.
Add Drivine-backed implementations that persist projection lineage and the
collector audit trail as graph nodes, so they survive a restart and stay
queryable. The graph-backed projection store also implements a real
markStaleByProposition (the SPI default is a no-op), keeping the lifecycle
cascade working against a durable store.

Both are wired through DiceStorageAutoConfiguration on the existing
embabel.dice.store.type=graph flip, default to in-memory otherwise, and are
ConditionalOnMissingBean so an application's own bean wins. Reads and writes
use parameterized Cypher (MERGE on the natural key for idempotent upserts);
row mapping is extracted so it can be unit-tested without a database.

Covered by a Neo4j integration test and row-mapper unit tests.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Add docs/design/graph-projection.md (lineage, named outcomes, the stale cascade,
idempotent ingestion/reconciliation, and reaching the graph through a port) and
docs/design/retrieval-and-discovery.md (store-agnostic graph queries, query-time
authority filtering, one router over many retrieval modes, DTO/context isolation,
anchorless serendipitous links, and explainability). Add module AGENTS.md for
dice-report, dice-ingestion, and dice-integration-tests, and list them in the
root guide.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
… and report paths

Wire SLF4J loggers through the retrieval router, graph/lineage stores, report
projectors, and ingestion so a consuming application can see the decision and
persistence paths:

- retrieval: log routing mode/topK/depth, result counts, and per-mode
  degradation when a capability is unsupported
- report: log rationale projection and structured-report counts, and
  semantic-link discovery sizes
- lineage/storage: log stale-cascade and reconciliation outcomes, Drivine
  collector/projection record writes, and auto-configuration wiring
- ingestion: log batch start/finish summaries, dedup hits, and extraction
  failures

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…bel.dice.spi

Update this branch's new code (graph projectors, graph/discovery query types,
the ingestion artifact model, and their tests) to import the policy SPIs
(TrustScorer, AuthorityResolver/AuthorityTier and friends, ConflictType) from
the com.embabel.dice.spi package they now live in.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…uides

The package map in dice/AGENTS.md skipped query.graph (GraphQuery,
GraphNeighborhood, GraphPath, PropositionLineage) and query.discovery
(RetrievalRouter, DiscoveryQuery, RetrievalMode) — public API this branch adds —
so an agent navigating by the map could not find them. Add both rows and mention
graph/discovery retrieval in the root module description.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
The sweep-policy types (MarkReason, PropositionMark, StatusTransitionSweepPolicy)
live in com.embabel.dice.spi alongside the other lifecycle policies. The storage
row mappers, the discovery DTO-leak gate, and the canonical-flow integration tests
still pointed at the old projection.memory package; repoint them so every module
compiles against the policy SPI.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Add a design note covering the persistence mechanics no existing note explained —
backend selection, defense-in-depth dedup, the two-phase save, materialised
effective confidence, schema-as-beans, and the scheduled decay tick — with diagrams.
Give dice-storage-autoconfigure its own AGENTS.md (the only module that lacked one),
and link graph-projection, retrieval-and-discovery, and durable-storage from the
README design-notes index and the root navigation guide.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
@jimador jimador force-pushed the feat/graph-backend-and-retrieval branch from ecae4ef to d4ebcbf Compare June 24, 2026 06:29
jimador added 4 commits June 24, 2026 02:49
Verified each reviewer finding against the code before acting (two were
false positives and left as-is: a Kotlin self-initializer scoping claim,
and the intentional, test-covered RELATED_TO fallback).

- LlmGraphProjector: pick the source/target mention by the LLM's span
  first, falling back to role only when no span matches. The combined
  `span || role` find let an earlier role-matching mention win over the
  mention the span actually named, producing wrong-direction edges.
- GraphProjectionService: isolate each lineage-record write so a flaky
  record store can't drop the trail for every remaining result after a
  mid-batch failure.
- GraphQuery.whyExplain: honor the context scope on the global findById
  path so a context-bound query can't return foreign-context lineage.
- GraphQueryCapable: the authority-aware overloads now throw when a
  backend sets honorsAuthorityFilter but doesn't override them, instead
  of silently returning unfiltered results.
- InMemoryProjectionRecordStore: make the stale check-and-set atomic so
  concurrent calls don't double-count transitions.
- Drivine projection/collector stores: skip and log a corrupt row rather
  than failing the whole all() query on one bad node.
- RetrievalRouter.graphPath: log cross-context paths that are dropped so
  an empty result is distinguishable from a disconnected graph.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…e SPIs

Second adversarial-review pass on the graph + retrieval branch.

- Drivine projection/collector stores: every findBy* now pushes its
  predicate into Cypher instead of loading the whole table and filtering
  in memory, so a single-key lookup no longer scans the entire lineage.
  Added Neo4j integration tests asserting each finder returns only its
  matching subset.
- GraphProjectionService: reconcile against the pre-persist graph state
  (a repository-backed reconciler consulted after the write would always
  see the node and never record PROJECTED), and reference the produced
  edge (source-[type]->target) as the lineage targetRef rather than just
  the source node so findByTargetRef resolves to the specific edge.
- MarkReason.Custom: reject the reserved stale/duplicate keys (and blanks)
  at construction so a Custom can't round-trip back as a built-in reason.
- Projection rejection messages quote the policy's actual confidence
  threshold instead of a hardcoded constant.
- DiscoveryQuery exposes a caller-set similarityThreshold (default 0.0,
  clamped) threaded into the vector and hybrid search requests.
- Discovery DTO leak test now rejects any raw com.embabel.dice.proposition
  type, catching an accidentally-exposed enum, not just the exact FQNs.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Make the architecture legible without reading the code, and close the
navigation gaps the review found.

- New docs/design/architecture.md: a top-level system overview tying the
  subsystems together (store + trust, extraction, maintenance, projection,
  query/retrieval/discovery, report) with system, store-SPI, maintenance,
  retrieval, expose-layer, and graph-schema diagrams.
- Enriched the per-subsystem design notes with sequence, class, state, and
  flow diagrams so each communicates intent visually (55 diagrams total,
  all parse-validated).
- AGENTS.md navigation: add GraphQueryCapable to the capability fragments,
  DiscoveryController to web.rest, DiscoveryTools/GraphQueryTools to agent,
  and the Drivine projection/collector record stores + LineageRowMappers
  to the storage module guide.
- proposition-lifecycle: add the pinning primitive; graph-projection:
  document the three-way reconciliation decision; fix a mermaid label.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…scoped read

Projection-health aggregated lineage across every context, leaking other
contexts' projection activity into a context-scoped endpoint/tool.

- ProjectionRecord carries the context the proposition belongs to.
- ProjectionRecordStore gains findByContext; the REST endpoint and agent
  tool summarize health from findByContext(contextId), not all().
- The durable Drivine store implements findByContext with scoped Cypher,
  and the in-memory store filters its backing list directly, so no
  implementation loads the whole table to answer a scoped read. The
  all()-based SPI defaults are documented as a trivial-store fallback that
  durable stores MUST override.
- Added a Neo4j integration test asserting findByContext returns only the
  requested context's records.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
@jimador jimador marked this pull request as ready for review June 24, 2026 07:15
jimador added 4 commits June 25, 2026 11:56
…bel#32)

The extraction pipeline never filled DICE's provenance model, so every
proposition left with provenanceEntries = [] — nothing downstream could trace a
fact to its source by URI, file, content hash, or chunk. Populate it where each
layer actually knows the source:

- PropositionPipeline.processChunk stamps every proposition with a
  ProvenanceEntry(chunkId, contentHash). The locator is the caller's
  SourceAnalysisContext.sourceLocator when set, else a ContentAddressedLocator
  over the chunk text — always available and honest about what grounds the fact.
  Stamped before revision, so merges union it and a new proposition keeps it.
- SourceAnalysisContext gains an optional sourceLocator + withSourceLocator.
- LlmPropositionReviser unions provenanceEntries on merge/reinforce (deduped),
  mirroring the existing grounding union.
- PropositionDto exposes provenance via a slim ProvenanceEntryDto.
- TextIngestionHandler sets the locator on the context and lets the pipeline
  stamp once, instead of stamping by hand after extraction — provenance now has
  a single owner across every ingestion path.

grounding: List<String> is unchanged (backward compatible).

Tests:
- PropositionPipelineTest.ProvenanceStampingTests + ProvenanceRevisionIntegrationTests
- PropositionReviserTest.ProvenanceUnionTests (real reviser, canonical-match merge)
- ProvenancePopulationE2ETest (dice-integration-tests): artifact -> handler ->
  pipeline -> store -> read-back -> REST DTO, batch multi-source, no-locator fallback

Closes embabel#32

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Record and fix the four Codex review findings around discovery wiring, projection lineage persistence, relationship reconciliation, and context-scoped graph queries.

Tests: ./mvnw -pl dice test; ./mvnw verify
Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…est gaps

A partial-failure batch the persister can't attribute per edge no longer paints a
succeeded edge as FAILED: persistenceFailed now decides precisely when per-edge refs
are present, and otherwise only fails an item when the whole batch failed. The shipped
persister always reports refs, so production behaviour is unchanged; this fixes a custom
or non-reporting persister mislabeling successes (and the null-relationship edge case).

Tests:
- GraphProjectionServiceLineageTest: a partial-failure batch marks only the attributed
  edge FAILED, and an unattributable partial failure marks no succeeded edge FAILED.
- SeededGraphNoDuplicateNodesIT: findRelated now answers from the live container and a
  second projection of the same proposition is asserted ADOPTED, exercising the
  reconciler's adopt path against a real graph instead of an always-empty mock.
- PropositionReviserTest: the reviser's reinforce path unions provenance (was only
  covered for merge).
- ProvenancePopulationE2ETest: assert exactly one provenance entry, catching a
  double-stamp regression the prior any{} check would miss.

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
@johnsonr johnsonr merged commit f348ca1 into embabel:main Jun 27, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment