DocIntel — AI Document Intelligence System

title	DocIntel
emoji	📄
colorFrom	blue
colorTo	indigo
sdk	docker
pinned	false

DocIntel — AI Document Intelligence System

A RAG (Retrieval-Augmented Generation) pipeline that lets you upload documents and ask natural language questions against them. Answers are grounded in your documents, not the internet — with source citations down to the page number.

Demo

Live demo: https://huggingface.co/spaces/hejun123/docintel

Upload a PDF → Ask a question → Get a grounded answer with page citations.

How it works

Document → Extract text → Chunk (512 chars, 100 overlap)
        → Embed (all-MiniLM-L6-v2) → Store in ChromaDB

Question → Embed → Retrieve top 20 candidates from ChromaDB
        → Re-rank with cross-encoder (ms-marco-MiniLM-L-6-v2)
        → Keep top 3 → Generate grounded answer via LLM

The two-stage retrieval is the key engineering decision: a bi-encoder (fast, approximate) fetches 20 candidates, then a cross-encoder (slower, precise) re-ranks them by scoring the question and each chunk jointly. This catches relevant chunks that vector similarity alone would miss.

Features

Upload PDF, DOCX, TXT, and Markdown files
Two-stage retrieval: bi-encoder + cross-encoder re-ranking
Grounded answers with page-level source citations
Persistent document library across server restarts
Delete documents (removes chunks from vector store)
Relevance threshold — explicitly says "I don't know" rather than hallucinating
Clean two-panel UI: document manager + chat interface

Tech stack

Layer	Technology	Why
Backend	Python + Flask	Lightweight, fast to iterate
PDF parsing	PyMuPDF	Handles messy PDFs better than PyPDF2
Text chunking	LangChain RecursiveCharacterTextSplitter	Respects paragraph/sentence boundaries
Embeddings	sentence-transformers (all-MiniLM-L6-v2)	Free, runs locally, 384-dim vectors
Re-ranking	sentence-transformers (ms-marco-MiniLM-L-6-v2)	Cross-encoder, significantly better precision
Vector database	ChromaDB	Local, persistent, no cloud account needed
LLM	OpenRouter (any free model)	Flexible model selection, free tier available
Frontend	HTML / CSS / Vanilla JS	No framework overhead for this scope

Project structure

docintel/
├── app.py          # Flask routes: /upload, /ask, /documents, /document/<name>
├── ingest.py       # Extract → chunk → embed → store pipeline
├── retriever.py    # Two-stage retrieval: bi-encoder + cross-encoder re-ranking
├── generator.py    # Prompt construction + LLM answer generation via OpenRouter
├── config.py       # Model names, chunk parameters, thresholds
├── requirements.txt
├── templates/
│   └── index.html
└── static/
    ├── style.css
    └── app.js

Setup

1. Clone and install dependencies

git clone https://github.com/hejun789/docintel.git
cd docintel
pip install -r requirements.txt

2. Create a .env file

OPENROUTER_API_KEY=your_openrouter_key_here
OPENROUTER_MODEL=nvidia/nemotron-3-super-120b-a12b:free

Get a free API key at openrouter.ai. Any model listed as free works.

3. Run

python app.py

Open http://127.0.0.1:5000 in your browser.

API endpoints

Method	Endpoint	Description
GET	`/`	Frontend UI
POST	`/upload`	Upload and ingest a document
POST	`/ask`	Ask a question, returns answer + sources
GET	`/documents`	List all ingested documents
DELETE	`/document/<filename>`	Remove a document and its chunks

Key design decisions

Why chunk overlap? If an answer spans a chunk boundary, overlap ensures the complete sentence appears in at least one chunk. Without it, split sentences produce incomplete, confusing context for the LLM.

Why a cross-encoder re-ranker? Bi-encoder similarity scores everything independently — fast but imprecise. A cross-encoder sees the question and chunk together, scoring their relevance jointly. The result is noticeably better precision, especially for specific technical questions.

Why all-MiniLM-L6-v2 for embeddings? Runs entirely locally at no cost, produces 384-dimensional vectors, and performs competitively with larger models on semantic similarity tasks. The cross-encoder re-ranker compensates for any retrieval imprecision.

Evaluation

Retrieval is measured against a hand-labeled question set (eval/eval_set.json), where each question is tagged with a distinctive phrase that must appear in the retrieved chunk. eval/evaluate.py reports recall and quantifies the value of the re-ranking stage:

python eval/evaluate.py

Results on a 14-question set (sample research paper):

Metric	Score	Meaning
Recall@20	93%	Gold chunk retrieved among bi-encoder candidates
Hit@3 (bi-encoder only)	79%	Gold chunk in top-3 without re-ranking
Hit@3 (with re-ranker)	93%	Gold chunk in top-3 with cross-encoder re-ranking
MRR	0.93	Mean reciprocal rank after re-ranking

The cross-encoder re-ranker lifts Hit@3 from 79% → 93% — concrete evidence that the second retrieval stage earns its cost by pulling the genuinely relevant chunk into the top-3 that reach the LLM.

Planned improvements

Source passage highlighting (show exact text used, not just page number)
Table extraction (PyMuPDF skips tables in technical PDFs)
HyDE retrieval (embed a hypothetical answer for better candidate recall)
Semantic chunking (split at meaning boundaries instead of fixed character count)
Multi-language support (Bahasa Malaysia, Chinese)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocIntel — AI Document Intelligence System

Demo

How it works

Features

Tech stack

Project structure

Setup

API endpoints

Key design decisions

Evaluation

Planned improvements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
eval		eval
static		static
templates		templates
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
config.py		config.py
generator.py		generator.py
ingest.py		ingest.py
render.yaml		render.yaml
requirements.txt		requirements.txt
retriever.py		retriever.py

Folders and files

Latest commit

History

Repository files navigation

DocIntel — AI Document Intelligence System

Demo

How it works

Features

Tech stack

Project structure

Setup

API endpoints

Key design decisions

Evaluation

Planned improvements

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages