Populate provenanceEntries during proposition extraction and revision

## Problem

DICE has a rich provenance model (`ProvenanceEntry`, `SourceLocator`) and `Proposition.provenanceEntries`, but the extraction pipeline never populates it. Every proposition extracted via `PropositionPipeline` ends up with `provenanceEntries = []`; only the legacy `grounding: List<String>` chunk-ID list is set.

Downstream consumers (REST API, Neo4j projection, MCP tooling, provenance export) cannot link propositions back to source material via URI, file path, content hash, or character offsets — even though the data model supports it.

## Evidence

- `SuggestedProposition.toProposition()` sets `grounding` only — no `provenanceEntries`
- `LlmPropositionExtractor.resolvePropositions()` calls `toProposition()` without provenance wiring
- `LlmPropositionReviser.mergePropositions()` / `reinforceProposition()` merge `grounding` but not `provenanceEntries`
- `PropositionDto` (REST) exposes `grounding` but not `provenanceEntries`
- `Proposition` KDoc describes `provenanceEntries` as complementing the legacy grounding list

Unit tests exist for `ProvenanceEntry` / `SourceLocator` and manual `withProvenanceEntries()` — but no integration test proves extraction populates them.

## Proposed solution

1. **`ProvenanceFactory` (or similar)** — build `ProvenanceEntry` from `Chunk` + optional `SourceAnalysisContext` fields:
   - `chunkId` from chunk
   - `contentHash` from `ContentHasher` (already used in `processOnce`)
   - `SourceLocator` from chunk metadata / `sourceId` (e.g. `ConnectorRef` for `email:<threadId>`, `UriLocator` when URI known)

2. **Wire into extraction** — `SuggestedProposition.toProposition()` or `resolvePropositions()` attaches provenance entries alongside grounding

3. **Wire into revision** — `mergePropositions` / `reinforceProposition` union `provenanceEntries` (like grounding)

4. **Extend `SourceAnalysisContext`** (optional) — carry a default `SourceLocator` or provenance template for the current analysis run

5. **REST** — add `provenanceEntries` to `PropositionDto` (or a slim DTO)

6. **Tests** — pipeline integration test asserting extracted propositions have non-empty `provenanceEntries` when chunk metadata is present

## Acceptance criteria

- [ ] Extracted propositions include at least one `ProvenanceEntry` with `chunkId` when processed via `processChunk`
- [ ] `processOnce` entries include `contentHash` when hasher is used
- [ ] Revision merge/reinforce accumulates provenance entries without duplicates
- [ ] `grounding` list unchanged (backward compatible)
- [ ] `PropositionDto` exposes provenance (or documented alternative)
- [ ] Tests cover extraction + revision paths

## Related

- #7 — extraction *context* metadata (speaker, turn, mode) — complementary, not duplicate
- `com.embabel.dice.provenance` package (already implemented)
- Specs/blog claim "full provenance tracking" — this closes the pipeline gap

## Motivation

Building MCP tooling and knowledge-graph apps where users need to verify *where* a proposition came from, not just which chunk ID grounded it. Happy to contribute a PR if this direction looks right.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Populate provenanceEntries during proposition extraction and revision #32

Problem

Evidence

Proposed solution

Acceptance criteria

Related

Motivation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Populate provenanceEntries during proposition extraction and revision #32

Description

Problem

Evidence

Proposed solution

Acceptance criteria

Related

Motivation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions