GroundedDoc Agent

GroundedDoc answers questions over a collection of documents and backs every answer with citations you can check. It is built for situations where a wrong or unsupported answer is worse than no answer: it surfaces disagreements between sources instead of papering over them, and it refuses to answer when the corpus does not contain the evidence. The included privacy-policy use case puts this to work, reviewing arbitrary web pages against regulations like GDPR and PIPEDA.

In short, GroundedDoc is about trustworthy document answers. Three properties follow from that goal and shape the whole system:

Grounding. Every factual statement maps to a cited source passage; nothing is asserted without evidence.
Conflict awareness. When two sources disagree (for example, on a data retention period), the disagreement is reported rather than silently resolved.
Honest refusal. When the corpus cannot support an answer, the agent says so instead of guessing.

Architecture

A document corpus is ingested into hierarchical indexes (parent summaries, child chunks, a keyword index, and an extracted-claims store). At query time a router decides whether the question is in scope, a planner decomposes it, an adaptive retriever pulls evidence from the most relevant indexes, a synthesizer drafts a cited answer, and a verifier checks every citation before the answer is returned. Every run is traced in MLflow so quality can be measured and gated in CI.

See docs/architecture.md for the component breakdown and diagrams.

Quickstart

python -m venv .venv
source .venv/bin/activate
pip install .

# Build indexes from sample corpus
python -m grounded_doc_agent.ingestion.cli

# Run a query locally
python -c "from grounded_doc_agent.agents.pipeline import DocumentPipeline; print(DocumentPipeline().run('Compare GDPR vs PIPEDA retention').answer)"

# Run MLflow evaluation
python -m grounded_doc_agent.eval.run_eval --variant full_pipeline

# Compare baseline vs full pipeline
python -m grounded_doc_agent.eval.run_eval --compare-ab

Optional: ADK web UI

For interactive exploration there is an optional Agent Development Kit (ADK) front end in agents/grounded_doc/agent.py. It is a dev/demo surface only -- the product itself is the Python library (grounded_doc_agent/) and the API below, neither of which requires ADK.

# Prerequisites: pip install ., indexes built (ingestion CLI), Google API key set
export GOOGLE_API_KEY=your-key
adk web agents --port 8000
# Open http://localhost:8000 and select "grounded_doc"

API / Cloud Run

Run the FastAPI service locally:

uvicorn api.main:app --reload --port 8080

The service exposes endpoints for querying the corpus, analyzing a page against regulations (used by the browser extension), rebuilding indexes, and listing detected conflicts. See docs/api.md for the full endpoint reference, request/response shapes, auth, and curl examples.

Browser Extension

Load extension/ as an unpacked extension in Chrome:

Build indexes (python -m grounded_doc_agent.ingestion.cli)
Start the API (uvicorn api.main:app --reload --port 8080)
Open chrome://extensions, enable Developer mode, load unpacked, select extension/
Open a privacy policy page and click the extension icon

The popup extracts page text, calls POST /analyze, and renders cited findings. Results are informational only, not legal advice.

Docker:

docker build -f infra/Dockerfile -t grounded-doc-agent .
docker run -p 8080:8080 grounded-doc-agent

Deploy script (requires gcloud auth + secrets):

GCP_PROJECT_ID=your-project ./scripts/deploy_cloud_run.sh

MLflow Scorers

Each query in the golden dataset is scored on the properties the project cares about, so quality regressions show up as numbers rather than vibes:

retrieval_recall -- fraction of the expected source documents that were actually retrieved. Measures whether the right evidence was found.
citation_fidelity -- fraction of an answer's citations that point to passages that were genuinely retrieved. Catches fabricated or mismatched citations.
retrieval_strategy_match -- whether the planner chose the expected retrieval strategy (or a compatible one) for a given query type.
refusal_correctness -- whether the agent refused exactly when it should have: refusing unanswerable questions and answering answerable ones.
conflict_surfaced -- whether known cross-document conflicts are explicitly reported in the answer instead of being hidden.

CI fails if core metrics drop below thresholds (0.85 recall, 0.90 citation fidelity, 1.0 refusal correctness).

Why this runs for ~$0

GroundedDoc deliberately avoids paid, managed services so it can run on a laptop or a free cloud tier with no recurring bill:

Embeddings run locally with open sentence-transformers models, so there is no embedding API to pay for.
The vector store is a plain JSON file on disk, so no managed vector database is needed.
Answer synthesis is optional. Gemini Flash can be used if you provide an API key, but an extractive fallback produces cited answers with no key and no cost.
Experiment tracking uses a local SQLite file (mlflow.db) rather than a hosted MLflow server.
Hosting fits free tiers. Cloud Run plus GCS free allowances comfortably cover portfolio-level traffic.

Sample corpus

A small example corpus ships with the repo so you can build indexes and run queries immediately, and so the evaluation suite has known-good expected answers. It lives under data/corpus/:

gdpr_summary.md, pipeda_summary.md -- plain-language summaries of two privacy regulations.
acme_privacy_policy.md -- a fictional company privacy policy to analyze against those regulations.
regulatory/ -- structured GDPR requirement "cards" used for claim-level retrieval.
fastapi_current.md, fastapi_migration.md -- a non-privacy topic, included to show retrieval works across unrelated subject matter.

The corpus intentionally contains conflicting data-retention statements (for example, between the regulations and the sample policy) so the conflict-detection features have something real to surface.

Optional GCS Persistence

For Cloud Run deployments with dynamic policy ingest:

pip install '.[gcp]'
export GROUNDED_STORAGE_BACKEND=gcs
export GROUNDED_GCS_BUCKET=your-bucket

On startup the API syncs index/base/ from GCS. Ingested policies are stored under corpus/policies/{hash}/.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
agents/grounded_doc		agents/grounded_doc
api		api
data/corpus		data/corpus
docs		docs
extension		extension
grounded_doc_agent		grounded_doc_agent
infra		infra
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GroundedDoc Agent

Architecture

Quickstart

Optional: ADK web UI

API / Cloud Run

Browser Extension

MLflow Scorers

Why this runs for ~$0

Sample corpus

Optional GCS Persistence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GroundedDoc Agent

Architecture

Quickstart

Optional: ADK web UI

API / Cloud Run

Browser Extension

MLflow Scorers

Why this runs for ~$0

Sample corpus

Optional GCS Persistence

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages