chore(notebooks): drop vendored _epinformerseq_v2/, rewire to HF mirror by JasonLinjc · Pull Request #89 · pinellolab/chorus

JasonLinjc · 2026-05-28T06:43:31Z

Summary

Removes the 11 MB of vendored EPInformer-seq per-cell training artifacts under examples/notebooks/_epinformerseq_v2/. Now that PR #87 lands the weights on lucapinello/chorus-epinformerseq-v2 and the oracle auto-fetches them on first use, the local copy was pure duplication. Local-only training artifacts (summary.json + test_preds.csv, ~3.5 MB) are reproducible from the cluster run at `/lustre/grp/zyjlab/linjc/epinformer/results/` if ever needed.

Changes

Deletions

`examples/notebooks/_epinformerseq_v2/` — 44 files, ~11 MB (22 `.pt`, 11 `summary.json`, 11 `test_preds.csv`, 1 `model.py`).
`examples/notebooks/epinformerseq_v2_percell_performance.ipynb` — pure viewer for the deleted `test_preds.csv`; per-cell test r values are preserved in the cluster's `summary.json`.

Rewired notebooks

`klf1_validated_enhancer_profiles.ipynb` — `s3-epi-slide` cell now loads weights via `EPInformerSeqOracle.load_pretrained_model()` (HF auto-fetch). Re-executed end-to-end against all 5 oracles.

Script update

`scripts/build_backgrounds_epinformerseq_v2_percell.py` — defaults now `~/.chorus/downloads/epinformerseq/{per_cell,bias}`. Imports `PerCellProfileNet`/`BiasNet` from the canonical `chorus.oracles.epinformerseq_source.model`. Auto-fetches via the oracle on cache miss.

README

`examples/notebooks/README.md` — added "Topic-focused notebooks" table for the KLF1 + testing notebooks (they were undocumented).

Drive-by fixes in `epinformerseq_testing.ipynb`

The wild-type + ISM cells still used `HALF=128` / `assert len(seq)==256` left over from the v1 256-bp model. The 1024-bp v2 model silently auto-pads, so it didn't crash but produced wrong values (smoke test: 0.62 vs the correct 5.23 — different inputs entirely). Fixed:

cell 7: `HALF=128` → `HALF=512`
cell 17: `assert len(ref_seq) == 1024`
cell 21: `xs` now spans the central 256 bp in genome coordinates (`REGION_START+384..REGION_START+639`); variant-rank computed against that slice.
cells 0, 6, 16, 29: markdown updated to reflect 1024-bp context + central-256-bp scalar aggregation, and the unsigned `enhancer_activity` LayerConfig.
Re-executed end-to-end.

Test plan

`tests/test_epinformerseq.py` 20/20 pass (`chorus-epinformerseq` env)
KLF1 notebook re-executed end-to-end without errors
EPInformer-seq testing notebook re-executed end-to-end without errors
Smoke test: HF auto-fetch + central-256-bp aggregation matches build_backgrounds geometry
`scripts/build_backgrounds_epinformerseq_v2_percell.py --help` parses with new defaults

🤖 Generated with Claude Code

PR #87 added the EPInformer-seq oracle but README still advertised six. Update hero pitch, oracle picker table, disk-usage breakdown, per-oracle setup block, mirror map, weight-size table, and oracle-list code blocks to include EPInformer-seq (per-cell PerCellProfileNet + frozen BiasNet, 11 Roadmap cells, 1024-bp scalar enhancer activity). Per-oracle footprint added to disk math: ~2 GB env + ~11 MB weights + ~770 KB CDF; total default install moves from ~28 GB to ~31 GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Removes the 11 MB of per-cell EPInformer-seq training artifacts that were vendored under examples/notebooks/_epinformerseq_v2/. The weights are now pulled on-demand from the HF mirror lucapinello/chorus-epinformerseq-v2 (PR #87), so the local copy was pure duplication; the local-only summary.json + test_preds.csv training artifacts (~3.5 MB) are reproducible from the cluster training run at /lustre/grp/zyjlab/linjc/epinformer/results/. Changes - examples/notebooks/_epinformerseq_v2/: deleted (44 files, ~11 MB). - examples/notebooks/epinformerseq_v2_percell_performance.ipynb: deleted (pure viewer for the deleted test_preds.csv; per-cell test r values are preserved on the cluster's summary.json). - examples/notebooks/klf1_validated_enhancer_profiles.ipynb: rewired cell s3-epi-slide to load weights via EPInformerSeqOracle (auto- downloads from HF on first use). s3-md updated to drop the vendored-ckpts mention. Re-executed end-to-end (5 oracles). - scripts/build_backgrounds_epinformerseq_v2_percell.py: defaults switched to ~/.chorus/downloads/epinformerseq/{per_cell,bias}. Import PerCellProfileNet/BiasNet from chorus.oracles.epinformerseq_source instead of the vendored model.py. Auto-fetches via the oracle on cache miss. - examples/notebooks/README.md: added topic-focused notebooks table for klf1_validated_enhancer_profiles.ipynb + epinformerseq_testing.ipynb. Drive-by fixes in epinformerseq_testing.ipynb The wild-type + ISM cells still used HALF=128 and asserted len(seq)==256 left over from the v1 256-bp model. The 1024-bp v2 model silently auto-pads, so it didn't crash but produced wrong values (0.62 vs the correct 5.23 in the smoke test). Fixed: - cell 7: HALF=128 -> HALF=512 (1024-bp window). - cell 17: assert len(ref_seq) == 1024. - cell 21: per-position xs now spans the central 256 bp in genome coordinates (REGION_START+384 .. REGION_START+639). Variant-rank computed against that slice. - cells 0, 6, 16, 29: docstring/markdown rewritten to reflect 1024-bp context + central-256-bp scalar aggregation, and the unsigned enhancer_activity LayerConfig (vs the old signed promoter_activity). - Re-executed end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolve overlap with main's #88 (EPInformer-seq catalog) and #89 (drop vendored _epinformerseq_v2/). Both landed the same logical changes this branch already carried; took this branch's newer roadmap-retrain versions for the EPInformer-seq descriptions (README), the executed notebooks, and the per_cell_widewin / PerCellProfileNetWide background builder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

JasonLinjc and others added 2 commits May 27, 2026 22:41

Copilot AI review requested due to automatic review settings May 28, 2026 06:43

Copilot started reviewing on behalf of JasonLinjc May 28, 2026 06:43 View session

JasonLinjc merged commit 03c6302 into main May 28, 2026
1 of 2 checks passed

JasonLinjc removed the request for review from Copilot May 28, 2026 07:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(notebooks): drop vendored _epinformerseq_v2/, rewire to HF mirror#89

chore(notebooks): drop vendored _epinformerseq_v2/, rewire to HF mirror#89
JasonLinjc merged 2 commits into
mainfrom
chore/cleanup-epinformerseq-vendored

JasonLinjc commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JasonLinjc commented May 28, 2026

Summary

Changes

Drive-by fixes in `epinformerseq_testing.ipynb`

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant