Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
852f76e
feat(projection): graph projection observability, lineage, and edge a…
jimador Jun 11, 2026
7847839
feat(report): rationale, structured report, and surprising-link proje…
jimador Jun 11, 2026
0f7b6cd
feat(store): repository-backed reconciliation, Neo4j adapter, and ing…
jimador Jun 11, 2026
68ae29a
feat(query): graph query surface and agent tools
jimador Jun 11, 2026
8355933
feat(discovery): retrieval-mode router with discovery REST and agent …
jimador Jun 11, 2026
7bd0daf
refactor: extract dice-report, dice-ingestion, and dice-integration-t…
jimador Jun 19, 2026
7622dc7
build: add dice-ingestion and dice-report to dependencyManagement for…
jimador Jun 19, 2026
0e9e76e
feat(storage): durable Neo4j projection and collector record stores
jimador Jun 19, 2026
87eaab9
docs: design notes and AGENTS.md for graph projection and retrieval
jimador Jun 19, 2026
3dadf48
feat(observability): add debug/trace logging across graph, retrieval,…
jimador Jun 22, 2026
b2722f1
refactor(spi): point graph, retrieval, and ingestion code at com.emba…
jimador Jun 22, 2026
11a17d8
docs: list the graph and discovery query packages in the navigation g…
jimador Jun 22, 2026
0fce364
fix: import the sweep-policy cluster from com.embabel.dice.spi
jimador Jun 23, 2026
d4ebcbf
docs: durable-storage design note and the autoconfigure module guide
jimador Jun 23, 2026
8302532
fix: address adversarial-review findings on graph + retrieval
jimador Jun 24, 2026
caf1836
fix: scope lineage queries, correct projection lineage, and harden th…
jimador Jun 24, 2026
9c3f445
docs: dense, diagram-heavy architecture notes and navigation fixes
jimador Jun 24, 2026
dc0dff2
fix: scope projection-health to its context and never load-all for a …
jimador Jun 24, 2026
3cd8523
feat: populate provenanceEntries during extraction and revision (#32)
jimador Jun 25, 2026
c81ded7
fix(graph): resolve PR 49 review findings
jimador Jun 25, 2026
5d19071
chore: remove ignored codex issue log
jimador Jun 25, 2026
08265fe
fix(graph): harden persistence-failure attribution and close review t…
jimador Jun 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@ DICE (Domain-Integrated Context Engineering) is a proposition-first knowledge su

| Module | What it owns |
|---|---|
| `dice` | The entire domain: `Proposition` model, `PropositionStore`/`PropositionRepository` SPIs, extraction pipeline, revision/conflict detection, entity resolution, projectors (graph, Prolog, memory), incremental analysis, in-memory and file-backed stores, tuProlog integration, REST endpoints |
| `dice` | The entire domain: `Proposition` model, `PropositionStore`/`PropositionRepository` SPIs, extraction pipeline, revision/conflict detection, entity resolution, projectors (graph, Prolog, memory), graph and discovery query/retrieval, incremental analysis, in-memory and file-backed stores, tuProlog integration, REST endpoints |
| `dice-storage` | Drivine/Neo4j implementation of `PropositionRepository`, `ChunkHistoryStore`, and `DecayManager`; uses Kotlin 2.2 for the Drivine KSP-generated query DSL |
| `dice-storage-autoconfigure` | Spring Boot auto-configuration that wires the right backend based on `embabel.dice.store.type` and schedules the decay tick |
| `dice-report` | Output projectors over propositions: rationale (why a fact is believed, with evidence), structured report, and surprising-link discovery |
| `dice-ingestion` | Ingestion SPI (artifacts → chunks) with a content-hash dedup ledger so the same source isn't extracted twice |
| `dice-integration-tests` | Test-only: the cross-feature end-to-end canonical-flow harness |

## Build & test

Expand Down Expand Up @@ -63,7 +66,7 @@ The `dice` module is organized by responsibility:

## Conventions

**Composable store SPI.** `PropositionStore` is the base port: just CRUD and a composable query. `PropositionRepository` extends it with optional capability interfaces (`VectorSearchCapable`, `GraphTraversalCapable`, `TemporalQueryCapable`). A backend only has to implement what it genuinely supports. The default implementations on those interfaces express each operation over the primitives so every backend gets safe fallback behavior for free.
**Composable store SPI.** `PropositionStore` is the base port: just CRUD and a composable query. `PropositionRepository` extends it with optional capability interfaces (`VectorSearchCapable`, `GraphTraversalCapable`, `TemporalQueryCapable`, `GraphQueryCapable`). A backend only has to implement what it genuinely supports. `GraphQueryCapable` provides native neighbourhood, path, and lineage queries over the entity-relationship graph, plus the `honorsAuthorityFilter` opt-in that lets the portable graph facade route authority-filtered traversals down to the native backend. The default implementations on those interfaces express each operation over the primitives so every backend gets safe fallback behavior for free.

**`ContextId` is the primary scope.** Every proposition belongs to a `ContextId`. Always start queries with `PropositionQuery.forContextId(...)` or `PropositionQuery.againstContext(...)` — there is no `create()` factory by design, to prevent accidentally loading all propositions.

Expand All @@ -85,4 +88,4 @@ The `dice` module is organized by responsibility:
- **Tuning what gets into the store** → admission gates in `com.embabel.dice.proposition.gate` (`ExtractionGatePipeline`, `StandardGates`); they run on pipeline output before the caller persists.
- **Running maintenance / consolidation** → `DefaultDreamLoopOrchestrator` (threshold-gated consolidation passes) or `DefaultMemoryMaintenanceOrchestrator` (the legacy four-step pipeline), both in `com.embabel.dice.projection.memory`.
- **Reclaiming stale or duplicate propositions** → `DefaultCollectorRunner` and its `CollectorStrategy` in `com.embabel.dice.projection.memory` (the `SweepPolicy` that decides each fate lives in `com.embabel.dice.spi`); runs are auditable via `CollectorRecordStore`.
- **Understanding *why* the system behaves as it does** → [`docs/design/`](docs/design/) holds the design-decision notes — the conceptual model and the reasoning you can't recover by reading a class: the extraction pipeline, the proposition lifecycle (trust, authority, supersession, decay), knowledge hygiene (gates, reclamation, consolidation), and the event model.
- **Understanding *why* the system behaves as it does** → [`docs/design/`](docs/design/) holds the design-decision notes — start with [`docs/design/architecture.md`](docs/design/architecture.md) for a system-level map and then follow to: the extraction pipeline, the proposition lifecycle (trust, authority, supersession, decay, pinning), knowledge hygiene (gates, reclamation, consolidation), graph projection, retrieval and discovery, durable storage (backends, dedup, the decay tick), and the event model.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,12 @@ recover by reading a single class — see the design notes in [`docs/design/`](d
abstraction, the four consolidation passes, and how a cycle composes and is triggered.
- [Reclamation and the collector](docs/design/reclamation-and-collector.md) — the mark-and-sweep
internals: strategies, sweep policy, dry-run vs. live, and the audit trail.
- [Graph projection](docs/design/graph-projection.md) — projecting propositions into a typed graph:
edge lineage, projection outcomes, the stale-cascade on source change, and idempotent reconciliation.
- [Retrieval and discovery](docs/design/retrieval-and-discovery.md) — store-agnostic graph queries,
query-time authority filtering, the single retrieval router, and serendipitous link discovery.
- [Durable storage](docs/design/durable-storage.md) — backend selection, defense-in-depth dedup,
two-phase save, materialised effective confidence, schema-as-beans, and the decay tick.
- [Events](docs/design/events.md) — the domain-event model the store and pipeline emit.

## Real-World Example: Impromptu
Expand Down
47 changes: 47 additions & 0 deletions dice-ingestion/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# dice-ingestion

This module is the front door for getting source material into DICE. It defines the SPI for turning normalized text into propositions and ships a content-hash deduplication ledger so the same source isn't extracted twice.

Core design decision: this module never parses. External adapters (document connectors, web scrapers, etc.) extract text into an `IngestedArtifact` before calling in — core receives pre-extracted text only.

## What's here

**The handoff types**

- `IngestedArtifact` — a normalized unit of source material: `sourceId` (stable dedup key, must not be blank), a `SourceLocator` for provenance, `text` (pre-extracted, must not be blank), an optional `contentHash` (caller-supplied dedup key; computed by the handler when absent), a `trust: AuthorityTier` (defaults to UNKNOWN), and optional timestamps. Has a Java-friendly fluent builder: `IngestedArtifact.withSourceId("…").withLocator(…).withText("…")`.
- `IngestionBatch` — a list of `IngestedArtifact`s submitted together. The primary handoff surface; single-artifact ingestion is a convenience that wraps in a one-element batch. Factory: `IngestionBatch.of(vararg artifacts)`.

**The SPI**

- `IngestionHandler` — interface with one real method: `ingest(batch: IngestionBatch, context: SourceAnalysisContext): IngestionResult`. The single-artifact overload is a default that delegates to the batch path. Adapters implement this or delegate to `TextIngestionHandler`.

**The result types**

- `IngestionResult` — wraps `List<ArtifactOutcome>`. The `propositions` property flattens the `Ingested` outcomes into a flat list (unsaved — persistence is the caller's concern).
- `ArtifactOutcome` — sealed interface with three variants:
- `Ingested(sourceId, propositions)` — newly extracted; carries the unsaved propositions.
- `Deduplicated(sourceId, contentHash)` — content hash already seen; no extraction ran.
- `Failed(sourceId, cause)` — extraction failed; the rest of the batch is unaffected.

**Deduplication ledger**

- `IngestionLedger` — interface: `seen(hash)`, `record(hash)`, `forget(hash)`, and `recordIfAbsent(hash)` (atomic check-and-claim; the default is non-atomic, override for concurrent use).
- `InMemoryIngestionLedger` — ships as the default. Backed by a `ConcurrentHashMap` key set. `recordIfAbsent` is truly atomic via `ConcurrentHashMap.add`. Survives only the process lifetime; supply a durable implementation for cross-session dedup.

**Shipped handler**

- `TextIngestionHandler` (`support/` subpackage) — the one shipped `IngestionHandler`. For each artifact it: (1) resolves the content hash (caller-supplied or SHA-256 of text), (2) atomically claims it via `ledger.recordIfAbsent` — short-circuits to `Deduplicated` if already seen, (3) bridges text to a `Chunk`, runs the `PropositionPipeline`, (4) stamps each returned proposition with a `ProvenanceEntry` carrying the artifact's locator. Failures release the claimed hash via `ledger.forget` so retries are not wrongly deduplicated. Processes a batch sequentially — intra-batch dedup relies on that ordering.

## Dependencies

- `dice` (core) — `Proposition`, `PropositionPipeline`, `SourceAnalysisContext`, `ProvenanceEntry`, `SourceLocator`, `AuthorityTier`.
- `embabel-agent-api` (provided) — `Chunk`, agent API types.
- `embabel-agent-rag-core` (provided) — `Retrievable`, supertype of `Proposition`.

## Gotchas

- Adapters must extract text before constructing `IngestedArtifact` — core never parses native formats.
- The default `InMemoryIngestionLedger` is process-scoped. Restart the process and it forgets everything; any re-submitted content will be re-extracted. Wire a durable ledger to prevent that.
- `TextIngestionHandler` processes batches sequentially. A parallel handler must supply its own atomic deduplication rather than relying on processing order.
- Propositions returned in `IngestionResult.propositions` are not yet persisted. The caller is responsible for saving them.
- `contentHash` in `IngestedArtifact` is caller-asserted, not verified. Pass a stable, content-derived hash; an unstable or wrong hash defeats deduplication.
63 changes: 63 additions & 0 deletions dice-ingestion/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>com.embabel.dice</groupId>
<artifactId>dice-parent</artifactId>
<version>0.1.0-SNAPSHOT</version>
</parent>
<artifactId>dice-ingestion</artifactId>
<packaging>jar</packaging>
<name>Dice Ingestion</name>
<description>Artifact ingestion SPI and handlers for DICE knowledge ingestion</description>

<dependencies>
<!-- Dice core: Proposition model and repository SPI -->
<dependency>
<groupId>com.embabel.dice</groupId>
<artifactId>dice</artifactId>
</dependency>

<!-- Embabel agent types; provided, supplied by the consuming app -->
<dependency>
<groupId>com.embabel.agent</groupId>
<artifactId>embabel-agent-api</artifactId>
<scope>provided</scope>
</dependency>
<!-- Retrievable (supertype of Proposition) lives in rag-core; provided, non-transitive from dice -->
<dependency>
<groupId>com.embabel.agent</groupId>
<artifactId>embabel-agent-rag-core</artifactId>
<scope>provided</scope>
</dependency>

<!-- Test -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.jetbrains.kotlin</groupId>
<artifactId>kotlin-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<!-- Default interface methods available for Java consumers -->
<groupId>org.jetbrains.kotlin</groupId>
<artifactId>kotlin-maven-plugin</artifactId>
<configuration>
<args>
<arg>-Xjvm-default=all</arg>
</args>
</configuration>
</plugin>
</plugins>
</build>

</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
/*
* Copyright 2024-2026 Embabel Pty Ltd.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.embabel.dice.ingestion

import com.embabel.dice.spi.AuthorityTier
import com.embabel.dice.provenance.SourceLocator
import java.time.Instant

/**
* A normalized unit of source material handed to DICE at the front door.
*
* Adapters parse their native formats (documents, web pages, connector payloads)
* into already-extracted [text] *before* constructing an artifact — core never
* parses. The artifact carries the source identity, a [SourceLocator] for
* provenance, an optional caller-supplied [contentHash] used as the
* deduplication key, trust metadata, and optional timestamps.
*
* The [locator] and [trust] fields are caller-asserted claims about the source,
* not proofs DICE can independently verify; downstream authority resolution
* derives tiers structurally from the locator kind.
*
* @property sourceId Stable source key used as the chunk parent identity and the
* per-artifact deduplication record key. Must not be blank.
* @property locator Provenance reference describing where the material lives.
* @property text Already-extracted text content. Must not be blank.
* @property contentHash Optional caller-supplied deduplication key. When present
* it is authoritative for dedup; when absent the consuming handler computes one.
* @property trust Caller-asserted authority of the source; defaults to
* [AuthorityTier.UNKNOWN], the fail-safe lowest authority.
* @property createdAt Optional timestamp for when the source material was created.
* @property ingestedAt Optional timestamp for when the material was ingested.
*/
data class IngestedArtifact @JvmOverloads constructor(
val sourceId: String,
val locator: SourceLocator,
val text: String,
val contentHash: String? = null,
val trust: AuthorityTier = AuthorityTier.UNKNOWN,
val createdAt: Instant? = null,
val ingestedAt: Instant? = null,
) {

init {
require(sourceId.isNotBlank()) { "sourceId must not be blank" }
require(text.isNotBlank()) { "text must not be blank" }
require(contentHash == null || contentHash.isNotBlank()) { "contentHash must not be blank when present" }
}

companion object {
/**
* Start building an artifact with its source identity.
* Entry point for the strongly-typed builder used from Java:
*
* ```java
* IngestedArtifact artifact = IngestedArtifact
* .withSourceId("doc-1")
* .withLocator(new UriLocator("https://example.com/doc"))
* .withText("extracted text")
* .withTrust(AuthorityTier.SECONDARY); // optional
* ```
*
* @param sourceId The stable source key for this artifact
* @return Builder step requiring a locator
*/
@JvmStatic
fun withSourceId(sourceId: String): WithSourceId = WithSourceId(sourceId)
}

/** Builder step: has source id, needs a locator. */
class WithSourceId internal constructor(private val sourceId: String) {
/**
* Set the provenance locator for the source material.
* @param locator The locator referencing where the material lives
* @return Builder step requiring text
*/
fun withLocator(locator: SourceLocator): WithLocator = WithLocator(sourceId, locator)
}

/** Builder step: has source id and locator, needs text; yields a complete artifact. */
class WithLocator internal constructor(
private val sourceId: String,
private val locator: SourceLocator,
) {
/**
* Set the already-extracted text, completing a minimal artifact.
* @param text The extracted text content
* @return A complete [IngestedArtifact]
*/
fun withText(text: String): IngestedArtifact =
IngestedArtifact(sourceId = sourceId, locator = locator, text = text)
}

/** Returns a copy with the deduplication content hash set. */
fun withContentHash(contentHash: String): IngestedArtifact = copy(contentHash = contentHash)

/** Returns a copy with the trust tier set. */
fun withTrust(trust: AuthorityTier): IngestedArtifact = copy(trust = trust)

/** Returns a copy with the created and ingested timestamps set. */
@JvmOverloads
fun withTimestamps(createdAt: Instant? = null, ingestedAt: Instant? = null): IngestedArtifact =
copy(createdAt = createdAt, ingestedAt = ingestedAt)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
/*
* Copyright 2024-2026 Embabel Pty Ltd.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.embabel.dice.ingestion

/**
* A group of [IngestedArtifact]s submitted together through the ingestion handoff.
*
* The batch is the primary public handoff surface: handlers process artifacts as
* a batch, isolating per-artifact failures rather than aborting the whole group.
* Single-artifact ingestion is a convenience that wraps one artifact in a batch.
*
* @property artifacts The artifacts to ingest, in submission order.
*/
data class IngestionBatch @JvmOverloads constructor(
val artifacts: List<IngestedArtifact> = emptyList(),
) {

companion object {
/**
* Build a batch from the given artifacts.
* @param artifacts The artifacts to include
* @return An [IngestionBatch] over those artifacts
*/
@JvmStatic
fun of(vararg artifacts: IngestedArtifact): IngestionBatch =
IngestionBatch(artifacts.toList())
}
}
Loading