Skip to content

chore(notebooks): drop vendored _epinformerseq_v2/, rewire to HF mirror#89

Merged
JasonLinjc merged 2 commits into
mainfrom
chore/cleanup-epinformerseq-vendored
May 28, 2026
Merged

chore(notebooks): drop vendored _epinformerseq_v2/, rewire to HF mirror#89
JasonLinjc merged 2 commits into
mainfrom
chore/cleanup-epinformerseq-vendored

Conversation

@JasonLinjc

Copy link
Copy Markdown
Collaborator

Summary

Removes the 11 MB of vendored EPInformer-seq per-cell training artifacts under examples/notebooks/_epinformerseq_v2/. Now that PR #87 lands the weights on lucapinello/chorus-epinformerseq-v2 and the oracle auto-fetches them on first use, the local copy was pure duplication. Local-only training artifacts (summary.json + test_preds.csv, ~3.5 MB) are reproducible from the cluster run at `/lustre/grp/zyjlab/linjc/epinformer/results/` if ever needed.

Changes

Deletions

  • `examples/notebooks/_epinformerseq_v2/` — 44 files, ~11 MB (22 `.pt`, 11 `summary.json`, 11 `test_preds.csv`, 1 `model.py`).
  • `examples/notebooks/epinformerseq_v2_percell_performance.ipynb` — pure viewer for the deleted `test_preds.csv`; per-cell test r values are preserved in the cluster's `summary.json`.

Rewired notebooks

  • `klf1_validated_enhancer_profiles.ipynb` — `s3-epi-slide` cell now loads weights via `EPInformerSeqOracle.load_pretrained_model()` (HF auto-fetch). Re-executed end-to-end against all 5 oracles.

Script update

  • `scripts/build_backgrounds_epinformerseq_v2_percell.py` — defaults now `~/.chorus/downloads/epinformerseq/{per_cell,bias}`. Imports `PerCellProfileNet`/`BiasNet` from the canonical `chorus.oracles.epinformerseq_source.model`. Auto-fetches via the oracle on cache miss.

README

  • `examples/notebooks/README.md` — added "Topic-focused notebooks" table for the KLF1 + testing notebooks (they were undocumented).

Drive-by fixes in `epinformerseq_testing.ipynb`

The wild-type + ISM cells still used `HALF=128` / `assert len(seq)==256` left over from the v1 256-bp model. The 1024-bp v2 model silently auto-pads, so it didn't crash but produced wrong values (smoke test: 0.62 vs the correct 5.23 — different inputs entirely). Fixed:

  • cell 7: `HALF=128` → `HALF=512`
  • cell 17: `assert len(ref_seq) == 1024`
  • cell 21: `xs` now spans the central 256 bp in genome coordinates (`REGION_START+384..REGION_START+639`); variant-rank computed against that slice.
  • cells 0, 6, 16, 29: markdown updated to reflect 1024-bp context + central-256-bp scalar aggregation, and the unsigned `enhancer_activity` LayerConfig.
  • Re-executed end-to-end.

Test plan

  • `tests/test_epinformerseq.py` 20/20 pass (`chorus-epinformerseq` env)
  • KLF1 notebook re-executed end-to-end without errors
  • EPInformer-seq testing notebook re-executed end-to-end without errors
  • Smoke test: HF auto-fetch + central-256-bp aggregation matches build_backgrounds geometry
  • `scripts/build_backgrounds_epinformerseq_v2_percell.py --help` parses with new defaults

🤖 Generated with Claude Code

JasonLinjc and others added 2 commits May 27, 2026 22:41
PR #87 added the EPInformer-seq oracle but README still advertised six.
Update hero pitch, oracle picker table, disk-usage breakdown, per-oracle
setup block, mirror map, weight-size table, and oracle-list code blocks
to include EPInformer-seq (per-cell PerCellProfileNet + frozen BiasNet,
11 Roadmap cells, 1024-bp scalar enhancer activity).

Per-oracle footprint added to disk math: ~2 GB env + ~11 MB weights +
~770 KB CDF; total default install moves from ~28 GB to ~31 GB.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes the 11 MB of per-cell EPInformer-seq training artifacts that
were vendored under examples/notebooks/_epinformerseq_v2/. The weights
are now pulled on-demand from the HF mirror
lucapinello/chorus-epinformerseq-v2 (PR #87), so the local copy was
pure duplication; the local-only summary.json + test_preds.csv training
artifacts (~3.5 MB) are reproducible from the cluster training run at
/lustre/grp/zyjlab/linjc/epinformer/results/.

Changes
- examples/notebooks/_epinformerseq_v2/: deleted (44 files, ~11 MB).
- examples/notebooks/epinformerseq_v2_percell_performance.ipynb: deleted
  (pure viewer for the deleted test_preds.csv; per-cell test r values
  are preserved on the cluster's summary.json).
- examples/notebooks/klf1_validated_enhancer_profiles.ipynb: rewired
  cell s3-epi-slide to load weights via EPInformerSeqOracle (auto-
  downloads from HF on first use). s3-md updated to drop the
  vendored-ckpts mention. Re-executed end-to-end (5 oracles).
- scripts/build_backgrounds_epinformerseq_v2_percell.py: defaults
  switched to ~/.chorus/downloads/epinformerseq/{per_cell,bias}. Import
  PerCellProfileNet/BiasNet from chorus.oracles.epinformerseq_source
  instead of the vendored model.py. Auto-fetches via the oracle on
  cache miss.
- examples/notebooks/README.md: added topic-focused notebooks table for
  klf1_validated_enhancer_profiles.ipynb + epinformerseq_testing.ipynb.

Drive-by fixes in epinformerseq_testing.ipynb
The wild-type + ISM cells still used HALF=128 and asserted len(seq)==256
left over from the v1 256-bp model. The 1024-bp v2 model silently
auto-pads, so it didn't crash but produced wrong values (0.62 vs the
correct 5.23 in the smoke test). Fixed:
- cell 7: HALF=128 -> HALF=512 (1024-bp window).
- cell 17: assert len(ref_seq) == 1024.
- cell 21: per-position xs now spans the central 256 bp in genome
  coordinates (REGION_START+384 .. REGION_START+639). Variant-rank
  computed against that slice.
- cells 0, 6, 16, 29: docstring/markdown rewritten to reflect 1024-bp
  context + central-256-bp scalar aggregation, and the unsigned
  enhancer_activity LayerConfig (vs the old signed promoter_activity).
- Re-executed end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 06:43
@JasonLinjc JasonLinjc merged commit 03c6302 into main May 28, 2026
1 of 2 checks passed
@JasonLinjc JasonLinjc removed the request for review from Copilot May 28, 2026 07:04
JasonLinjc added a commit that referenced this pull request Jun 4, 2026
Resolve overlap with main's #88 (EPInformer-seq catalog) and #89 (drop
vendored _epinformerseq_v2/). Both landed the same logical changes this
branch already carried; took this branch's newer roadmap-retrain versions
for the EPInformer-seq descriptions (README), the executed notebooks, and
the per_cell_widewin / PerCellProfileNetWide background builder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant