Graph backend + retrieval#49
Merged
johnsonr merged 22 commits intoJun 27, 2026
Merged
Conversation
54946d7 to
78d9acc
Compare
6 tasks
f4f845b to
e923dd0
Compare
384734d to
ecae4ef
Compare
…uthority Makes projection to a graph observable and traceable, and carries source authority through to the projected edges. - GraphProjectionService and RelationBasedGraphProjector report why a projection was skipped or failed instead of failing silently; Projection carries structured failure reasons - ProjectionRecord lineage records what projected where; ProjectionLineageStaleCascade cascades a stale mark to everything a stale proposition projected - ProjectionPolicySupport carries a proposition's source authority onto the projected edge Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…ctors Adds projectors that turn propositions into output for people rather than graphs. - RationaleProjector / LlmRationaleProjector explain why a conclusion holds - ReportProjector / StructuredReportProjector assemble a structured report - SemanticLink / SemanticLinkDiscoverer surface non-obvious two-hop connections Also fills out the KDoc on AlwaysCreateEntityResolver and EscalatingEntityResolver so the default resolver behavior is documented at the call site. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…estion Wires the projection-lineage SPI to a real repository, adds a reference Neo4j adapter, and gives DICE one front door for getting source material in. - RepositoryBackedReconciler resolves whether to create or reuse a graph artifact by looking it up, instead of always creating — what keeps re-projection from duplicating nodes; comes with a seeded-graph integration proof - Neo4jRagPropositionRepository is a reference store backed by the RAG entity store, declaring only the capability fragments it honestly supports - ingestion SPI: IngestionHandler / TextIngestionHandler turn artifacts into chunks; IngestionLedger dedups by content hash; IngestionResult reports per batch - Testcontainers test-scope deps and a Docker Engine api.version pin for the proof Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Adds the read side over the graph: ask for an entity's neighborhood, the path between two entities, or why a proposition is believed. - GraphQuery with GraphNeighborhood, GraphPath, and PropositionLineage - GraphQueryCapable is the store fragment that answers these, and can filter by the source authority carried on each edge - GraphQueryTools exposes the queries as agent tools Includes the canonical-flow harness that runs the whole extract -> resolve -> project -> query path end to end without an LLM or a database. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…tools Adds a router that picks how to retrieve for a given query, with REST and agent surfaces over it. - RetrievalMode and RetrievalRouter route a query to the right strategy, including a hybrid mode that combines vector and graph - DiscoveryQuery / DiscoveryDtos, DiscoveryController, and DiscoveryTools expose the router over REST and as agent tools, registered through DiceRestConfiguration Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…ests modules Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
… cross-module refs Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
DICE shipped only in-memory ProjectionRecordStore and CollectorRecordStore. Add Drivine-backed implementations that persist projection lineage and the collector audit trail as graph nodes, so they survive a restart and stay queryable. The graph-backed projection store also implements a real markStaleByProposition (the SPI default is a no-op), keeping the lifecycle cascade working against a durable store. Both are wired through DiceStorageAutoConfiguration on the existing embabel.dice.store.type=graph flip, default to in-memory otherwise, and are ConditionalOnMissingBean so an application's own bean wins. Reads and writes use parameterized Cypher (MERGE on the natural key for idempotent upserts); row mapping is extracted so it can be unit-tested without a database. Covered by a Neo4j integration test and row-mapper unit tests. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Add docs/design/graph-projection.md (lineage, named outcomes, the stale cascade, idempotent ingestion/reconciliation, and reaching the graph through a port) and docs/design/retrieval-and-discovery.md (store-agnostic graph queries, query-time authority filtering, one router over many retrieval modes, DTO/context isolation, anchorless serendipitous links, and explainability). Add module AGENTS.md for dice-report, dice-ingestion, and dice-integration-tests, and list them in the root guide. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
… and report paths Wire SLF4J loggers through the retrieval router, graph/lineage stores, report projectors, and ingestion so a consuming application can see the decision and persistence paths: - retrieval: log routing mode/topK/depth, result counts, and per-mode degradation when a capability is unsupported - report: log rationale projection and structured-report counts, and semantic-link discovery sizes - lineage/storage: log stale-cascade and reconciliation outcomes, Drivine collector/projection record writes, and auto-configuration wiring - ingestion: log batch start/finish summaries, dedup hits, and extraction failures Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…bel.dice.spi Update this branch's new code (graph projectors, graph/discovery query types, the ingestion artifact model, and their tests) to import the policy SPIs (TrustScorer, AuthorityResolver/AuthorityTier and friends, ConflictType) from the com.embabel.dice.spi package they now live in. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…uides The package map in dice/AGENTS.md skipped query.graph (GraphQuery, GraphNeighborhood, GraphPath, PropositionLineage) and query.discovery (RetrievalRouter, DiscoveryQuery, RetrievalMode) — public API this branch adds — so an agent navigating by the map could not find them. Add both rows and mention graph/discovery retrieval in the root module description. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
The sweep-policy types (MarkReason, PropositionMark, StatusTransitionSweepPolicy) live in com.embabel.dice.spi alongside the other lifecycle policies. The storage row mappers, the discovery DTO-leak gate, and the canonical-flow integration tests still pointed at the old projection.memory package; repoint them so every module compiles against the policy SPI. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Add a design note covering the persistence mechanics no existing note explained — backend selection, defense-in-depth dedup, the two-phase save, materialised effective confidence, schema-as-beans, and the scheduled decay tick — with diagrams. Give dice-storage-autoconfigure its own AGENTS.md (the only module that lacked one), and link graph-projection, retrieval-and-discovery, and durable-storage from the README design-notes index and the root navigation guide. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
ecae4ef to
d4ebcbf
Compare
Verified each reviewer finding against the code before acting (two were false positives and left as-is: a Kotlin self-initializer scoping claim, and the intentional, test-covered RELATED_TO fallback). - LlmGraphProjector: pick the source/target mention by the LLM's span first, falling back to role only when no span matches. The combined `span || role` find let an earlier role-matching mention win over the mention the span actually named, producing wrong-direction edges. - GraphProjectionService: isolate each lineage-record write so a flaky record store can't drop the trail for every remaining result after a mid-batch failure. - GraphQuery.whyExplain: honor the context scope on the global findById path so a context-bound query can't return foreign-context lineage. - GraphQueryCapable: the authority-aware overloads now throw when a backend sets honorsAuthorityFilter but doesn't override them, instead of silently returning unfiltered results. - InMemoryProjectionRecordStore: make the stale check-and-set atomic so concurrent calls don't double-count transitions. - Drivine projection/collector stores: skip and log a corrupt row rather than failing the whole all() query on one bad node. - RetrievalRouter.graphPath: log cross-context paths that are dropped so an empty result is distinguishable from a disconnected graph. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…e SPIs Second adversarial-review pass on the graph + retrieval branch. - Drivine projection/collector stores: every findBy* now pushes its predicate into Cypher instead of loading the whole table and filtering in memory, so a single-key lookup no longer scans the entire lineage. Added Neo4j integration tests asserting each finder returns only its matching subset. - GraphProjectionService: reconcile against the pre-persist graph state (a repository-backed reconciler consulted after the write would always see the node and never record PROJECTED), and reference the produced edge (source-[type]->target) as the lineage targetRef rather than just the source node so findByTargetRef resolves to the specific edge. - MarkReason.Custom: reject the reserved stale/duplicate keys (and blanks) at construction so a Custom can't round-trip back as a built-in reason. - Projection rejection messages quote the policy's actual confidence threshold instead of a hardcoded constant. - DiscoveryQuery exposes a caller-set similarityThreshold (default 0.0, clamped) threaded into the vector and hybrid search requests. - Discovery DTO leak test now rejects any raw com.embabel.dice.proposition type, catching an accidentally-exposed enum, not just the exact FQNs. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Make the architecture legible without reading the code, and close the navigation gaps the review found. - New docs/design/architecture.md: a top-level system overview tying the subsystems together (store + trust, extraction, maintenance, projection, query/retrieval/discovery, report) with system, store-SPI, maintenance, retrieval, expose-layer, and graph-schema diagrams. - Enriched the per-subsystem design notes with sequence, class, state, and flow diagrams so each communicates intent visually (55 diagrams total, all parse-validated). - AGENTS.md navigation: add GraphQueryCapable to the capability fragments, DiscoveryController to web.rest, DiscoveryTools/GraphQueryTools to agent, and the Drivine projection/collector record stores + LineageRowMappers to the storage module guide. - proposition-lifecycle: add the pinning primitive; graph-projection: document the three-way reconciliation decision; fix a mermaid label. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…scoped read Projection-health aggregated lineage across every context, leaking other contexts' projection activity into a context-scoped endpoint/tool. - ProjectionRecord carries the context the proposition belongs to. - ProjectionRecordStore gains findByContext; the REST endpoint and agent tool summarize health from findByContext(contextId), not all(). - The durable Drivine store implements findByContext with scoped Cypher, and the in-memory store filters its backing list directly, so no implementation loads the whole table to answer a scoped read. The all()-based SPI defaults are documented as a trivial-store fallback that durable stores MUST override. - Added a Neo4j integration test asserting findByContext returns only the requested context's records. Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…bel#32) The extraction pipeline never filled DICE's provenance model, so every proposition left with provenanceEntries = [] — nothing downstream could trace a fact to its source by URI, file, content hash, or chunk. Populate it where each layer actually knows the source: - PropositionPipeline.processChunk stamps every proposition with a ProvenanceEntry(chunkId, contentHash). The locator is the caller's SourceAnalysisContext.sourceLocator when set, else a ContentAddressedLocator over the chunk text — always available and honest about what grounds the fact. Stamped before revision, so merges union it and a new proposition keeps it. - SourceAnalysisContext gains an optional sourceLocator + withSourceLocator. - LlmPropositionReviser unions provenanceEntries on merge/reinforce (deduped), mirroring the existing grounding union. - PropositionDto exposes provenance via a slim ProvenanceEntryDto. - TextIngestionHandler sets the locator on the context and lets the pipeline stamp once, instead of stamping by hand after extraction — provenance now has a single owner across every ingestion path. grounding: List<String> is unchanged (backward compatible). Tests: - PropositionPipelineTest.ProvenanceStampingTests + ProvenanceRevisionIntegrationTests - PropositionReviserTest.ProvenanceUnionTests (real reviser, canonical-match merge) - ProvenancePopulationE2ETest (dice-integration-tests): artifact -> handler -> pipeline -> store -> read-back -> REST DTO, batch multi-source, no-locator fallback Closes embabel#32 Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Record and fix the four Codex review findings around discovery wiring, projection lineage persistence, relationship reconciliation, and context-scoped graph queries. Tests: ./mvnw -pl dice test; ./mvnw verify Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
…est gaps
A partial-failure batch the persister can't attribute per edge no longer paints a
succeeded edge as FAILED: persistenceFailed now decides precisely when per-edge refs
are present, and otherwise only fails an item when the whole batch failed. The shipped
persister always reports refs, so production behaviour is unchanged; this fixes a custom
or non-reporting persister mislabeling successes (and the null-relationship edge case).
Tests:
- GraphProjectionServiceLineageTest: a partial-failure batch marks only the attributed
edge FAILED, and an unattributable partial failure marks no succeeded edge FAILED.
- SeededGraphNoDuplicateNodesIT: findRelated now answers from the live container and a
second projection of the same proposition is asserted ADOPTED, exercising the
reconciler's adopt path against a real graph instead of an always-empty mock.
- PropositionReviserTest: the reviser's reinforce path unions provenance (was only
covered for merge).
- ProvenancePopulationE2ETest: assert exactly one provenance entry, catching a
double-stamp regression the prior any{} check would miss.
Signed-off-by: James Dunnam <7660553+jimador@users.noreply.github.com>
This was referenced Jun 26, 2026
johnsonr
approved these changes
Jun 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Graph backend + retrieval
Third of the four-PR stack, and the largest — the E2E harness and the
report/ingestionmodulesare coupled to graph-query, so the whole retrieval layer ships together. Stacked on #48 — review
after #47 and #48 merge; until then the diff includes their commits.
Carves out three modules:
dice-report,dice-ingestion,dice-integration-tests.What's in it
back to its source propositions and their authority tier, a stale-cascade invalidates dependent
projections when a source changes, and each projection reports a success / skip / fail outcome.
structured report over a set of propositions, and a
SemanticLinkDiscovererthat surfacesnon-obvious multi-hop connections direct queries would miss.
chunks, a content-hash ledger dedups before extraction), a reconciler that looks up existing graph
artifacts before creating new ones, and a Neo4j adapter on embabel-agent's RAG
NamedEntityDataRepository.(neighborhood / path / lineage) behind a capability interface, exposed as
@LlmToolagent tools.Queries respect edge authority, so low-trust edges can be filtered at query time.
retrieval modes (vector, entity, graph-walk, temporal, hybrid) behind one entry point, exposed
over REST and as agent tools, with DTOs keeping internal types off the wire.
Resolves pre-existing design issues
(
RetrievalRouter/RetrievalMode) plusSemanticLinkDiscoverersurface indirect, multi-hopconnections the direct lookups missed.
SemanticLinkDiscovererrealizes "surprise" as non-obviouslinks the graph implies but nobody stated directly.
Related but not closed here: #7 (extraction context metadata).
Closes #40
Closes #41
Closes #42
Closes #43
Closes #44
Closes #18
Closes #1
Closes #32