A lightweight, local pipeline for extracting biomedical entities from scientific documents and generating structured reports.
This project processes user-provided documents (e.g., PDFs for a single publication) and extracts biomedical entities such as genetic variants. It is inspired by PubTator-style pipelines but designed to run fully locally and remain modular and extensible.
At present, the pipeline supports:
- Variant extraction only
- Powered by TMVar3 (Java CRF-based model)
- Detection of:
- DNA mutations (e.g.,
c.233A>G) - Protein mutations (e.g.,
p.Asn78Ser) - Deletions, substitutions, frameshifts, nonsense variants
- DNA mutations (e.g.,
- Extraction from full-text biomedical PDFs via text parsing + BioC conversion
Additional entity types (genes, diseases, chemicals) are planned but not yet enabled.
The system performs the following steps:
- Load documents from an input directory
- Extract text from PDFs using PyMuPDF
- Normalize and clean text for biomedical NLP
- Convert text to BioC format (TMVar-compatible)
- Run TMVar3 (Java CRF pipeline)
- Parse PubTator output into structured entities
- Aggregate results into a JSON report
Install Python dependencies:
pip install -r requirements.txtTMVar3 requires:
- Java 8+ (recommended Java 11 or 17)
- Sufficient heap memory (default uses
-Xmx5G)
Run the pipeline using:
python3 -m bio_text_annotator.cli \
--input-dir tests/test_data/ \
--source-id nihms-1892649 \
--verbose-
--input-dirDirectory containing documents for a single source (PDFs only currently)
-
--source-idIdentifier for the dataset/publication being processed
-
--outputOutput path for JSON report Default:
./outputs/report.json -
--recursiveRecursively search input directory for documents
-
--formatsFile types to include (PDFs only currently)
-
--verboseEnable debug logging
-
--keep-tempPreserve intermediate TMVar files for debugging
-
--output-modeOutput structure format:
document(default): grouped per documentflat: fully aggregated entity list
-
--heap-sizeJava heap size for TMVar3 execution (e.g.,
2G,5G,8G). Default:5G(recommended).Adjust based on available system memory and document size.
The pipeline produces a structured JSON report:
{
"source_id": "nihms-1892649",
"documents": [
{
"doc_id": "...",
"entities": [
{
"type": "variant",
"text": "c.233A>G",
"start": 19563,
"end": 19571,
"subtype": "SUB",
"normalized_id": "c|SUB|A|233|G"
}
]
}
]
}