Graph backend + retrieval by jimador · Pull Request #49 · embabel/dice

jimador · 2026-06-19T12:18:53Z

Graph backend + retrieval

Third of the four-PR stack, and the largest — the E2E harness and the report / ingestion modules
are coupled to graph-query, so the whole retrieval layer ships together. Stacked on #48 — review
after #47 and #48 merge; until then the diff includes their commits.

Carves out three modules: dice-report, dice-ingestion, dice-integration-tests.

What's in it

Make graph projection traceable and self-healing (Make graph projection traceable and self-healing #40) — every projected edge carries lineage
back to its source propositions and their authority tier, a stale-cascade invalidates dependent
projections when a source changes, and each projection reports a success / skip / fail outcome.
Let the graph explain itself (Let the graph explain itself #41) — three output projectors: a rationale citing evidence, a
structured report over a set of propositions, and a SemanticLinkDiscoverer that surfaces
non-obvious multi-hop connections direct queries would miss.
Get source material in, and keep it in a durable graph (Get source material in, and keep it in a durable graph #42) — an ingestion SPI (handlers →
chunks, a content-hash ledger dedups before extraction), a reconciler that looks up existing graph
artifacts before creating new ones, and a Neo4j adapter on embabel-agent's RAG
NamedEntityDataRepository.
Ask the graph questions — and let agents ask too (Ask the graph questions — and let agents ask too #43) — a store-agnostic graph query API
(neighborhood / path / lineage) behind a capability interface, exposed as @LlmTool agent tools.
Queries respect edge authority, so low-trust edges can be filtered at query time.
Route each question to the right retrieval mode (Route each question to the right retrieval mode #44) — a router that picks or combines
retrieval modes (vector, entity, graph-walk, temporal, hybrid) behind one entry point, exposed
over REST and as agent tools, with DTOs keeping internal types off the wire.

Resolves pre-existing design issues

Serendipitous knowledge retrieval via graph traversal #18 Serendipitous knowledge retrieval via graph traversal — the graph-walk retrieval mode
(RetrievalRouter / RetrievalMode) plus SemanticLinkDiscoverer surface indirect, multi-hop
connections the direct lookups missed.
Consider concept of surprise in memory or proposition extraction #1 Consider concept of surprise — SemanticLinkDiscoverer realizes "surprise" as non-obvious
links the graph implies but nobody stated directly.

Related but not closed here: #7 (extraction context metadata).

Closes #40
Closes #41
Closes #42
Closes #43
Closes #44
Closes #18
Closes #1
Closes #32

…uthority Makes projection to a graph observable and traceable, and carries source authority through to the projected edges. - GraphProjectionService and RelationBasedGraphProjector report why a projection was skipped or failed instead of failing silently; Projection carries structured failure reasons - ProjectionRecord lineage records what projected where; ProjectionLineageStaleCascade cascades a stale mark to everything a stale proposition projected - ProjectionPolicySupport carries a proposition's source authority onto the projected edge Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…ctors Adds projectors that turn propositions into output for people rather than graphs. - RationaleProjector / LlmRationaleProjector explain why a conclusion holds - ReportProjector / StructuredReportProjector assemble a structured report - SemanticLink / SemanticLinkDiscoverer surface non-obvious two-hop connections Also fills out the KDoc on AlwaysCreateEntityResolver and EscalatingEntityResolver so the default resolver behavior is documented at the call site. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…estion Wires the projection-lineage SPI to a real repository, adds a reference Neo4j adapter, and gives DICE one front door for getting source material in. - RepositoryBackedReconciler resolves whether to create or reuse a graph artifact by looking it up, instead of always creating — what keeps re-projection from duplicating nodes; comes with a seeded-graph integration proof - Neo4jRagPropositionRepository is a reference store backed by the RAG entity store, declaring only the capability fragments it honestly supports - ingestion SPI: IngestionHandler / TextIngestionHandler turn artifacts into chunks; IngestionLedger dedups by content hash; IngestionResult reports per batch - Testcontainers test-scope deps and a Docker Engine api.version pin for the proof Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

Adds the read side over the graph: ask for an entity's neighborhood, the path between two entities, or why a proposition is believed. - GraphQuery with GraphNeighborhood, GraphPath, and PropositionLineage - GraphQueryCapable is the store fragment that answers these, and can filter by the source authority carried on each edge - GraphQueryTools exposes the queries as agent tools Includes the canonical-flow harness that runs the whole extract -> resolve -> project -> query path end to end without an LLM or a database. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…tools Adds a router that picks how to retrieve for a given query, with REST and agent surfaces over it. - RetrievalMode and RetrievalRouter route a query to the right strategy, including a hybrid mode that combines vector and graph - DiscoveryQuery / DiscoveryDtos, DiscoveryController, and DiscoveryTools expose the router over REST and as agent tools, registered through DiceRestConfiguration Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…ests modules Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

… cross-module refs Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

DICE shipped only in-memory ProjectionRecordStore and CollectorRecordStore. Add Drivine-backed implementations that persist projection lineage and the collector audit trail as graph nodes, so they survive a restart and stay queryable. The graph-backed projection store also implements a real markStaleByProposition (the SPI default is a no-op), keeping the lifecycle cascade working against a durable store. Both are wired through DiceStorageAutoConfiguration on the existing embabel.dice.store.type=graph flip, default to in-memory otherwise, and are ConditionalOnMissingBean so an application's own bean wins. Reads and writes use parameterized Cypher (MERGE on the natural key for idempotent upserts); row mapping is extracted so it can be unit-tested without a database. Covered by a Neo4j integration test and row-mapper unit tests. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

Add docs/design/graph-projection.md (lineage, named outcomes, the stale cascade, idempotent ingestion/reconciliation, and reaching the graph through a port) and docs/design/retrieval-and-discovery.md (store-agnostic graph queries, query-time authority filtering, one router over many retrieval modes, DTO/context isolation, anchorless serendipitous links, and explainability). Add module AGENTS.md for dice-report, dice-ingestion, and dice-integration-tests, and list them in the root guide. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

… and report paths Wire SLF4J loggers through the retrieval router, graph/lineage stores, report projectors, and ingestion so a consuming application can see the decision and persistence paths: - retrieval: log routing mode/topK/depth, result counts, and per-mode degradation when a capability is unsupported - report: log rationale projection and structured-report counts, and semantic-link discovery sizes - lineage/storage: log stale-cascade and reconciliation outcomes, Drivine collector/projection record writes, and auto-configuration wiring - ingestion: log batch start/finish summaries, dedup hits, and extraction failures Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…bel.dice.spi Update this branch's new code (graph projectors, graph/discovery query types, the ingestion artifact model, and their tests) to import the policy SPIs (TrustScorer, AuthorityResolver/AuthorityTier and friends, ConflictType) from the com.embabel.dice.spi package they now live in. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…uides The package map in dice/AGENTS.md skipped query.graph (GraphQuery, GraphNeighborhood, GraphPath, PropositionLineage) and query.discovery (RetrievalRouter, DiscoveryQuery, RetrievalMode) — public API this branch adds — so an agent navigating by the map could not find them. Add both rows and mention graph/discovery retrieval in the root module description. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

The sweep-policy types (MarkReason, PropositionMark, StatusTransitionSweepPolicy) live in com.embabel.dice.spi alongside the other lifecycle policies. The storage row mappers, the discovery DTO-leak gate, and the canonical-flow integration tests still pointed at the old projection.memory package; repoint them so every module compiles against the policy SPI. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

Add a design note covering the persistence mechanics no existing note explained — backend selection, defense-in-depth dedup, the two-phase save, materialised effective confidence, schema-as-beans, and the scheduled decay tick — with diagrams. Give dice-storage-autoconfigure its own AGENTS.md (the only module that lacked one), and link graph-projection, retrieval-and-discovery, and durable-storage from the README design-notes index and the root navigation guide. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

Verified each reviewer finding against the code before acting (two were false positives and left as-is: a Kotlin self-initializer scoping claim, and the intentional, test-covered RELATED_TO fallback). - LlmGraphProjector: pick the source/target mention by the LLM's span first, falling back to role only when no span matches. The combined `span || role` find let an earlier role-matching mention win over the mention the span actually named, producing wrong-direction edges. - GraphProjectionService: isolate each lineage-record write so a flaky record store can't drop the trail for every remaining result after a mid-batch failure. - GraphQuery.whyExplain: honor the context scope on the global findById path so a context-bound query can't return foreign-context lineage. - GraphQueryCapable: the authority-aware overloads now throw when a backend sets honorsAuthorityFilter but doesn't override them, instead of silently returning unfiltered results. - InMemoryProjectionRecordStore: make the stale check-and-set atomic so concurrent calls don't double-count transitions. - Drivine projection/collector stores: skip and log a corrupt row rather than failing the whole all() query on one bad node. - RetrievalRouter.graphPath: log cross-context paths that are dropped so an empty result is distinguishable from a disconnected graph. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…e SPIs Second adversarial-review pass on the graph + retrieval branch. - Drivine projection/collector stores: every findBy* now pushes its predicate into Cypher instead of loading the whole table and filtering in memory, so a single-key lookup no longer scans the entire lineage. Added Neo4j integration tests asserting each finder returns only its matching subset. - GraphProjectionService: reconcile against the pre-persist graph state (a repository-backed reconciler consulted after the write would always see the node and never record PROJECTED), and reference the produced edge (source-[type]->target) as the lineage targetRef rather than just the source node so findByTargetRef resolves to the specific edge. - MarkReason.Custom: reject the reserved stale/duplicate keys (and blanks) at construction so a Custom can't round-trip back as a built-in reason. - Projection rejection messages quote the policy's actual confidence threshold instead of a hardcoded constant. - DiscoveryQuery exposes a caller-set similarityThreshold (default 0.0, clamped) threaded into the vector and hybrid search requests. - Discovery DTO leak test now rejects any raw com.embabel.dice.proposition type, catching an accidentally-exposed enum, not just the exact FQNs. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

Make the architecture legible without reading the code, and close the navigation gaps the review found. - New docs/design/architecture.md: a top-level system overview tying the subsystems together (store + trust, extraction, maintenance, projection, query/retrieval/discovery, report) with system, store-SPI, maintenance, retrieval, expose-layer, and graph-schema diagrams. - Enriched the per-subsystem design notes with sequence, class, state, and flow diagrams so each communicates intent visually (55 diagrams total, all parse-validated). - AGENTS.md navigation: add GraphQueryCapable to the capability fragments, DiscoveryController to web.rest, DiscoveryTools/GraphQueryTools to agent, and the Drivine projection/collector record stores + LineageRowMappers to the storage module guide. - proposition-lifecycle: add the pinning primitive; graph-projection: document the three-way reconciliation decision; fix a mermaid label. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…scoped read Projection-health aggregated lineage across every context, leaking other contexts' projection activity into a context-scoped endpoint/tool. - ProjectionRecord carries the context the proposition belongs to. - ProjectionRecordStore gains findByContext; the REST endpoint and agent tool summarize health from findByContext(contextId), not all(). - The durable Drivine store implements findByContext with scoped Cypher, and the in-memory store filters its backing list directly, so no implementation loads the whole table to answer a scoped read. The all()-based SPI defaults are documented as a trivial-store fallback that durable stores MUST override. - Added a Neo4j integration test asserting findByContext returns only the requested context's records. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…bel#32) The extraction pipeline never filled DICE's provenance model, so every proposition left with provenanceEntries = [] — nothing downstream could trace a fact to its source by URI, file, content hash, or chunk. Populate it where each layer actually knows the source: - PropositionPipeline.processChunk stamps every proposition with a ProvenanceEntry(chunkId, contentHash). The locator is the caller's SourceAnalysisContext.sourceLocator when set, else a ContentAddressedLocator over the chunk text — always available and honest about what grounds the fact. Stamped before revision, so merges union it and a new proposition keeps it. - SourceAnalysisContext gains an optional sourceLocator + withSourceLocator. - LlmPropositionReviser unions provenanceEntries on merge/reinforce (deduped), mirroring the existing grounding union. - PropositionDto exposes provenance via a slim ProvenanceEntryDto. - TextIngestionHandler sets the locator on the context and lets the pipeline stamp once, instead of stamping by hand after extraction — provenance now has a single owner across every ingestion path. grounding: List<String> is unchanged (backward compatible). Tests: - PropositionPipelineTest.ProvenanceStampingTests + ProvenanceRevisionIntegrationTests - PropositionReviserTest.ProvenanceUnionTests (real reviser, canonical-match merge) - ProvenancePopulationE2ETest (dice-integration-tests): artifact -> handler -> pipeline -> store -> read-back -> REST DTO, batch multi-source, no-locator fallback Closes embabel#32 Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

Record and fix the four Codex review findings around discovery wiring, projection lineage persistence, relationship reconciliation, and context-scoped graph queries. Tests: ./mvnw -pl dice test; ./mvnw verify Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

…est gaps A partial-failure batch the persister can't attribute per edge no longer paints a succeeded edge as FAILED: persistenceFailed now decides precisely when per-edge refs are present, and otherwise only fails an item when the whole batch failed. The shipped persister always reports refs, so production behaviour is unchanged; this fixes a custom or non-reporting persister mislabeling successes (and the null-relationship edge case). Tests: - GraphProjectionServiceLineageTest: a partial-failure batch marks only the attributed edge FAILED, and an unattributable partial failure marks no succeeded edge FAILED. - SeededGraphNoDuplicateNodesIT: findRelated now answers from the live container and a second projection of the same proposition is asserted ADOPTED, exercising the reconciler's adopt path against a real graph instead of an always-empty mock. - PropositionReviserTest: the reviser's reinforce path unions provenance (was only covered for merge). - ProvenancePopulationE2ETest: assert exactly one provenance entry, catching a double-stamp regression the prior any{} check would miss. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

jimador mentioned this pull request Jun 19, 2026

Metamodel + knowledge bundles #50

Draft

jimador requested review from jasperblues and johnsonr June 19, 2026 12:28

jimador force-pushed the feat/graph-backend-and-retrieval branch 8 times, most recently from 54946d7 to 78d9acc Compare June 22, 2026 16:45

LordKay-sudo mentioned this pull request Jun 22, 2026

Populate provenanceEntries during proposition extraction and revision #32

Closed

6 tasks

jimador force-pushed the feat/graph-backend-and-retrieval branch from f4f845b to e923dd0 Compare June 22, 2026 19:33

jimador mentioned this pull request Jun 23, 2026

Extraction concurrency + maintenance #48

Merged

jimador force-pushed the feat/graph-backend-and-retrieval branch 3 times, most recently from 384734d to ecae4ef Compare June 24, 2026 06:10

jimador added 13 commits June 24, 2026 02:26

refactor: extract dice-report, dice-ingestion, and dice-integration-t…

7bd0daf

…ests modules Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

build: add dice-ingestion and dice-report to dependencyManagement for…

7622dc7

… cross-module refs Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

jimador force-pushed the feat/graph-backend-and-retrieval branch from ecae4ef to d4ebcbf Compare June 24, 2026 06:29

jimador added 4 commits June 24, 2026 02:49

jimador marked this pull request as ready for review June 24, 2026 07:15

jimador added 4 commits June 25, 2026 11:56

chore: remove ignored codex issue log

5d19071

Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>

This was referenced Jun 26, 2026

Wire provenanceEntries through extraction and revision #51

Closed

feat(mcp): expose DICE tools for MCP server integration #52

Open

johnsonr approved these changes Jun 27, 2026

View reviewed changes

johnsonr merged commit f348ca1 into embabel:main Jun 27, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Graph backend + retrieval#49

Graph backend + retrieval#49
johnsonr merged 22 commits into
embabel:mainfrom
jimador:feat/graph-backend-and-retrieval

jimador commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jimador commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!