Skip to content

Populate provenanceEntries during proposition extraction and revision #32

Description

@LordKay-sudo

Problem

DICE has a rich provenance model (ProvenanceEntry, SourceLocator) and Proposition.provenanceEntries, but the extraction pipeline never populates it. Every proposition extracted via PropositionPipeline ends up with provenanceEntries = []; only the legacy grounding: List<String> chunk-ID list is set.

Downstream consumers (REST API, Neo4j projection, MCP tooling, provenance export) cannot link propositions back to source material via URI, file path, content hash, or character offsets — even though the data model supports it.

Evidence

  • SuggestedProposition.toProposition() sets grounding only — no provenanceEntries
  • LlmPropositionExtractor.resolvePropositions() calls toProposition() without provenance wiring
  • LlmPropositionReviser.mergePropositions() / reinforceProposition() merge grounding but not provenanceEntries
  • PropositionDto (REST) exposes grounding but not provenanceEntries
  • Proposition KDoc describes provenanceEntries as complementing the legacy grounding list

Unit tests exist for ProvenanceEntry / SourceLocator and manual withProvenanceEntries() — but no integration test proves extraction populates them.

Proposed solution

  1. ProvenanceFactory (or similar) — build ProvenanceEntry from Chunk + optional SourceAnalysisContext fields:

    • chunkId from chunk
    • contentHash from ContentHasher (already used in processOnce)
    • SourceLocator from chunk metadata / sourceId (e.g. ConnectorRef for email:<threadId>, UriLocator when URI known)
  2. Wire into extractionSuggestedProposition.toProposition() or resolvePropositions() attaches provenance entries alongside grounding

  3. Wire into revisionmergePropositions / reinforceProposition union provenanceEntries (like grounding)

  4. Extend SourceAnalysisContext (optional) — carry a default SourceLocator or provenance template for the current analysis run

  5. REST — add provenanceEntries to PropositionDto (or a slim DTO)

  6. Tests — pipeline integration test asserting extracted propositions have non-empty provenanceEntries when chunk metadata is present

Acceptance criteria

  • Extracted propositions include at least one ProvenanceEntry with chunkId when processed via processChunk
  • processOnce entries include contentHash when hasher is used
  • Revision merge/reinforce accumulates provenance entries without duplicates
  • grounding list unchanged (backward compatible)
  • PropositionDto exposes provenance (or documented alternative)
  • Tests cover extraction + revision paths

Related

  • Proposition provenance metadata #7 — extraction context metadata (speaker, turn, mode) — complementary, not duplicate
  • com.embabel.dice.provenance package (already implemented)
  • Specs/blog claim "full provenance tracking" — this closes the pipeline gap

Motivation

Building MCP tooling and knowledge-graph apps where users need to verify where a proposition came from, not just which chunk ID grounded it. Happy to contribute a PR if this direction looks right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions