You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DICE has a rich provenance model (ProvenanceEntry, SourceLocator) and Proposition.provenanceEntries, but the extraction pipeline never populates it. Every proposition extracted via PropositionPipeline ends up with provenanceEntries = []; only the legacy grounding: List<String> chunk-ID list is set.
Downstream consumers (REST API, Neo4j projection, MCP tooling, provenance export) cannot link propositions back to source material via URI, file path, content hash, or character offsets — even though the data model supports it.
Evidence
SuggestedProposition.toProposition() sets grounding only — no provenanceEntries
LlmPropositionExtractor.resolvePropositions() calls toProposition() without provenance wiring
LlmPropositionReviser.mergePropositions() / reinforceProposition() merge grounding but not provenanceEntries
PropositionDto (REST) exposes grounding but not provenanceEntries
Proposition KDoc describes provenanceEntries as complementing the legacy grounding list
Unit tests exist for ProvenanceEntry / SourceLocator and manual withProvenanceEntries() — but no integration test proves extraction populates them.
Specs/blog claim "full provenance tracking" — this closes the pipeline gap
Motivation
Building MCP tooling and knowledge-graph apps where users need to verify where a proposition came from, not just which chunk ID grounded it. Happy to contribute a PR if this direction looks right.
Problem
DICE has a rich provenance model (
ProvenanceEntry,SourceLocator) andProposition.provenanceEntries, but the extraction pipeline never populates it. Every proposition extracted viaPropositionPipelineends up withprovenanceEntries = []; only the legacygrounding: List<String>chunk-ID list is set.Downstream consumers (REST API, Neo4j projection, MCP tooling, provenance export) cannot link propositions back to source material via URI, file path, content hash, or character offsets — even though the data model supports it.
Evidence
SuggestedProposition.toProposition()setsgroundingonly — noprovenanceEntriesLlmPropositionExtractor.resolvePropositions()callstoProposition()without provenance wiringLlmPropositionReviser.mergePropositions()/reinforceProposition()mergegroundingbut notprovenanceEntriesPropositionDto(REST) exposesgroundingbut notprovenanceEntriesPropositionKDoc describesprovenanceEntriesas complementing the legacy grounding listUnit tests exist for
ProvenanceEntry/SourceLocatorand manualwithProvenanceEntries()— but no integration test proves extraction populates them.Proposed solution
ProvenanceFactory(or similar) — buildProvenanceEntryfromChunk+ optionalSourceAnalysisContextfields:chunkIdfrom chunkcontentHashfromContentHasher(already used inprocessOnce)SourceLocatorfrom chunk metadata /sourceId(e.g.ConnectorRefforemail:<threadId>,UriLocatorwhen URI known)Wire into extraction —
SuggestedProposition.toProposition()orresolvePropositions()attaches provenance entries alongside groundingWire into revision —
mergePropositions/reinforcePropositionunionprovenanceEntries(like grounding)Extend
SourceAnalysisContext(optional) — carry a defaultSourceLocatoror provenance template for the current analysis runREST — add
provenanceEntriestoPropositionDto(or a slim DTO)Tests — pipeline integration test asserting extracted propositions have non-empty
provenanceEntrieswhen chunk metadata is presentAcceptance criteria
ProvenanceEntrywithchunkIdwhen processed viaprocessChunkprocessOnceentries includecontentHashwhen hasher is usedgroundinglist unchanged (backward compatible)PropositionDtoexposes provenance (or documented alternative)Related
com.embabel.dice.provenancepackage (already implemented)Motivation
Building MCP tooling and knowledge-graph apps where users need to verify where a proposition came from, not just which chunk ID grounded it. Happy to contribute a PR if this direction looks right.