Model-Aware Parameterization from Literature Evidence
QSP models have many biological parameters, and most can't be measured directly in the clinical context being modeled. The relevant data is usually scattered across papers, often from different species or indications entirely. Turning that into informative priors is tedious, error-prone, and rarely done systematically.
MAPLE provides a structured pipeline for this. It uses LLMs to extract measurements from papers into validated YAML schemas (with anti-hallucination checks against source text), and scores how well each data source translates to the model context across eight axes (species, indication, TME compatibility, etc.). Each axis contributes a component to a translation sigma that widens the likelihood for that target during joint inference — so a mouse in vitro measurement constraining the same parameter as a human clinical measurement will naturally contribute less. The output is a set of marginal distributions plus a Gaussian copula that preserves posterior correlations for downstream SBI calibration.
MAPLE fits into a two-stage calibration pipeline:
| Stage | Input | Method | Output |
|---|---|---|---|
| 1a (this repo) | Scientific literature | LLM extraction + Pydantic validation | SubmodelTarget / CalibrationTarget YAMLs |
| 1b (qsp-inference) | SubmodelTarget YAMLs + priors CSV | Joint MCMC (NumPyro/NUTS) | submodel_priors.yaml (marginals + copula) |
| 2 (qsp-inference) | Copula priors + clinical data + full QSP simulator | SBI (SNPE-C) | Final posterior |
pip install maple-qspMAPLE works with any AI tool that can access your files and run Python — coding agents (Claude Code, Codex, Cursor) via MCP, or chat UIs with code execution (Claude Cowork, ChatGPT with Code Interpreter) via the Python API. From your model repo, ask the agent to extract a parameter:
"Use the MAPLE tool to help me extract the k_IL6_sec parameter"
The agent loads the extraction guide, investigates the parameter in your model code (units, mechanistic role, Hill function inputs), searches literature for constraining data, verifies DOIs, fetches PDFs from Zotero, and then extracts the SubmodelTarget YAML with validation at each step. The agent drives the workflow and tells you what to do at each step (e.g., add a paper to Zotero, digitize a figure). Your job is to verify that extracted inputs match the paper, that the agent isn't making up assumptions, and that the forward and error models make sense for the parameter and data source.
Once you have targets, run joint inference via qsp-inference:
from qsp_inference.submodel.prior import process_targets
result = process_targets(priors_csv="pdac_priors.csv", yaml_paths=["target1.yaml", "target2.yaml"])Each YAML file is a structured extraction from one paper. It connects a literature measurement to one or more model parameters through a self-contained forward model:
Example: IL-2 degradation rate from half-life data
target_id: k_IL2_deg_deriv001
inputs:
- name: t_half_alpha
value: 6.0
units: minute
source_ref: Lotze1985
value_snippet: "a half-life of approximately 5 to 7 min"
calibration:
parameters:
- name: k_IL2_deg
units: 1/minute
forward_model:
type: algebraic
formula: "t_half = ln(2) / k"
code: |
def compute(params, inputs):
import numpy as np
return np.log(2) / params['k_IL2_deg']
error_model:
- name: halflife_obs
units: minute
uses_inputs: [t_half_alpha, t_half_beta]
sample_size_input: n_patients
observation_code: |
def derive_observation(inputs, sample_size, rng, n_bootstrap):
import numpy as np
vals = [inputs['t_half_alpha'], inputs['t_half_beta']]
mu, sigma = np.mean(np.log(vals)), np.std(np.log(vals), ddof=1)
return rng.lognormal(mu, sigma, n_bootstrap)
source_relevance:
indication_match: related
species_source: human
species_target: human
source_quality: primary_human_clinical
# ... (8-axis rubric → translation sigma applied in likelihood)Forward model types include algebraic formulas, dose-response curves (direct_fit), power laws, and ODE systems — both structured types with analytical solutions (exponential_growth, first_order_decay, logistic, etc.) and arbitrary user-provided ODEs (custom_ode) integrated numerically via diffrax. The source relevance assessment maps to a translation sigma that inflates the likelihood during inference — so mouse data naturally gets less weight than human data constraining the same parameter.
Nuisance parameters can be marked nuisance: true when needed by the forward model but not part of the QSP model (e.g., a proliferation rate that helps constrain an activation rate). They carry their own inline prior, are sampled during MCMC, but are excluded from the output priors.
There's also a CalibrationTarget schema for clinical/in vivo observables (biopsies, blood draws) that require full model simulation — these feed into Stage 2.
For extracting many parameters at once, MAPLE supports a staged batch pipeline that automates the multi-step workflow across a set of targets. Each stage caches its results per-target, so you can rerun any stage for any subset without redoing work.
Stage 1 Lit search Web search for papers per target (parallel)
Stage 1b PDF collection Zotero DOI lookup + interactive fetch loop
Stage 2 Paper assessment Read PDFs, assess data quality (parallel)
Stage 2b Plan review Single LLM call reviewing all plans together
Digitization summary Prioritized list of figures to digitize
--- human digitization step (WebPlotDigitizer) ---
Stage 3 Extract Assemble SubmodelTarget YAMLs (parallel)
Stage 3b Derivation review Single LLM call checking scientific soundness
Stage 3c Validate MCMC prior derivation + unit checks + snippet matching
Input: a CSV listing target parameters:
target_id,parameters,cancer_type,notes
k_IL2_sec,k_IL2_sec,PDAC,"Per-cell IL-2 secretion rate. Search for: ELISA, single-cell secretion rates."
k_vas_growth,k_vas_growth,PDAC,"Rate law: dK/dt = k_vas_growth * C_total * VEGF/(VEGF+VEGF_50). Search for: MVD growth kinetics."The notes field guides the lit search agent. Include rate laws, search terms, and context about what kind of data would constrain the parameter. Richer notes produce better search results.
Per-target caching: each target gets a directory (work/staged_extraction/{target_id}/) with independently cached files:
lit_search_results.json(stage 1)assessment.json(stage 2){target_id}_*_deriv001.yaml(stage 3)
To rerun a specific stage for a specific target, delete its cache file. Other targets and stages are untouched.
An LLM agent with web search finds 3-5 papers per target with quantitative data matching the parameter's model role. Each candidate includes:
- DOI (validated against CrossRef)
- Relevance summary and mapping concerns
- Jointly constrainable parameters (other QSP params the paper could also constrain)
Notes in the targets CSV matter. A terse note like "angiogenesis rate" may return nothing, while a note including the rate law and specific search terms ("MVD growth kinetics, vascular doubling times") finds relevant papers.
PDFs are fetched from Zotero's local SQLite database by exact DOI lookup (case-insensitive). An interactive loop handles missing papers:
- Auto-fetch from Zotero storage
- Copy missing DOIs to clipboard for Zotero "Add by Identifier"
- Press Enter to re-fetch, 'b' to open in browser for manual download, 's' to skip
- Final summary of still-missing papers with clickable DOI links
Each paper is read (PDF attached to the LLM) and assessed for:
- Data availability and location (table, text, or figure)
- Mapping quality to the model parameter
- Digitization need and priority (
critical/helpful/optional/not_needed) - Paper role:
standalone,required_for_derivation,alternative, orvalidation_only - Forward model suggestion and jointly constrainable parameters
The output is an extraction plan: the minimal set of papers (and specific figures/tables) needed for one complete derivation, plus alternative plans.
A single LLM call reviews all extraction plans together, checking for:
- Proxy measurements when direct data exists in an alternative
- Small sample sizes when larger datasets are available
- Excessive digitization burden when simpler alternatives exist
- Empty plans that need lit search reruns
Verdicts: proceed, switch_to_alt (swaps the plan in assessment.json), rerun_lit_search (deletes caches, appends search guidance to targets CSV), or defer.
After plan review, a prioritized digitization summary shows which figures need WebPlotDigitizer treatment. Items are ranked:
- [REQUIRED]: in the extraction plan and critical priority
- By priority: critical > helpful > optional
- Extraction plan items vs alternatives/validation
Place WPD CSV exports in work/staged_extraction/{target_id}/digitized/{source_tag}/. The pipeline reads these automatically during extraction.
The LLM assembles a SubmodelTarget YAML following the extraction plan. It sees:
- The extraction plan with explicit "FOLLOW THIS" instructions
- Plan review reasoning for why this plan was chosen
- Paper PDFs (filtered to plan papers only to avoid context overflow)
- Digitized data CSVs
- Parameter context from model_structure.json
- Prior sanity check (current median/sigma)
Output is validated against the SubmodelTarget schema before writing.
A single LLM call reviews all completed derivations for scientific soundness:
- Forward model appropriateness
- Input data fidelity and unit conversions
- Biological plausibility
- Derivation logic (circular reasoning, proxy assumptions)
- Cross-target consistency (contradictory assumptions, redundant constraints)
Mechanical validation per target:
- SubmodelTarget schema validation against model_structure.json
- Snippet-in-paper verification
- Passing targets are copied to
calibration_targets/submodel_targets/
Copy examples/staged_extraction.py into your model repo and edit the config section at the top (paths, model name, target range). The script is designed to be run interactively in a Jupyter/IPython notebook or copy-pasted into a REPL stage by stage.
For extracting one parameter at a time with an AI coding agent. MAPLE works with any AI tool that can run Python and access your files.
For Claude Code, Codex, Cursor, and other MCP-compatible agents. Add to .claude/settings.json:
{
"mcpServers": {
"maple": {
"command": "python",
"args": ["-m", "maple.mcp_server"]
}
}
}For Claude Cowork, ChatGPT with Code Interpreter, or any environment that can pip install and run Python. The same tools are available as plain functions:
from maple.mcp_server import extract_target, validate_target
# Load the extraction guide
guide = extract_target("submodel_target")
# Validate a target YAML
report = validate_target("path/to/target.yaml", "pdac_priors.csv")| Tool | Purpose |
|---|---|
extract_target(target_type) |
Load the full extraction guide (schema, workflow, enum values, hard rules) |
validate_target(yaml_path, priors_csv) |
Schema validation + snippet verification |
verify_dois(dois) |
Verify DOIs resolve via CrossRef, return metadata |
fetch_papers_from_zotero(dois) |
Copy PDFs from Zotero's local storage into paper directories |
Inference tools (
run_joint_inference,compare_inference) have moved to qsp-inference.
Agent and You labels indicate who drives each step.
- You — Ask the agent to extract a parameter (e.g., "use the MAPLE tool to extract k_IL6_sec")
- Agent — Loads the extraction guide (
extract_target) - Agent — Investigates the parameter in your model code: units, mechanistic role, Hill function inputs, interactions with other parameters
- Agent — Searches literature for quantitative data that constrains the parameter; verifies DOIs via CrossRef
- You — Add papers to Zotero; agent calls
fetch_papers_from_zoteroto pull PDFs - Agent — Reads the paper. If figures contain richer data (scatter plots, dose-response curves), asks you to digitize with WebPlotDigitizer
- Agent — Builds the SubmodelTarget YAML incrementally, validating at each step
- You — Review inputs, forward model, and error model. Check that values match the paper, assumptions are justified, and the model makes sense for the data
- Agent — Runs
validate_target— fixes schema, MCMC, or snippet errors - Iterate steps 7-9 until validation passes
Pydantic validators catch common extraction failures automatically:
- Anti-hallucination — extracted values must appear in
value_snippet; snippets are verified against paper text via Europe PMC / Unpaywall - Unit validation — all unit strings checked against Pint
- Code validation — forward model and observation code syntax and execution checks
- DOI verification — CrossRef resolution and metadata matching
- Invisible characters — catches zero-width spaces and other PDF copy-paste artifacts
Schema details, validator reference, and inference pipeline internals are in CLAUDE.md.
MIT