You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
UnifiedMatrixBuilder.build_matrix_chunked() (introduced in #753) resolved the ~49.6 GB RSS OOM of the non-chunked path on full-CPS national runs, but it does not scale to production. At default settings a full run produces ~207 chunks of 25,000 clone-household columns each and executes them serially on one 8-CPU / 64 GB Modal container, which cannot finish inside the 14 h timeout (remote_calibration_runner.py:440). The final-assembly step at unified_matrix_builder.py:3489–3506 also reintroduces a memory peak by loading every per-chunk COO shard into three Python lists before np.concatenate, partially undoing the gain from chunking.
We want the chunked builder to finish full-scale production runs in a fraction of current wall time and under a memory budget that doesn't regress vs. per-chunk peak.
docs/methodology.md:422 ("approximately 5.2 million total records entering calibration")
Calibration targets (rows)
3,842
docs/internals/README.md:130
Default chunk size
25,000 cols
calibration/unified_matrix_builder.py:3071
Chunks per production run
~207
derived
Per-worker Modal ceiling
8 CPU / 64 GB / 14 h
modal_app/remote_calibration_runner.py:440
Existing H5 worker concurrency
max_containers=50
modal_app/local_area.py:452
Note on prior confusion: the create_stratified_cps_dataset function has target_households=30_000 as its signature default, but that path is only taken by programmatic callers with no args. The production __main__ entry point (what data_build.py invokes) hard-codes target = 12_000.
Current build_matrix_chunked() (unified_matrix_builder.py:3065–3519) is a ~450-line method that owns partition, per-chunk materialization, per-chunk Microsimulation construction, COO shard writing, resume manifest validation, progress logging, and final CSR assembly. This matches the refactor doc's observation that UnifiedMatrixBuilder is overloaded (refactor §6: target-repo work, uprating, simulation, geography indexing, constraint evaluation, take-up handling, sparse assembly, and now orchestration).
Per-chunk work is structurally independent: the only cross-chunk state is constraint_cache (built once at line 3136, immutable thereafter) and the geography arrays (read-only). This is embarrassingly parallel.
Goals
Full-CPS chunked matrix build finishes end-to-end on Modal within the current 14 h ceiling, with substantial wall-time headroom for future growth.
Peak memory during final assembly bounded by final-CSR size, not ~2× via list-then-concatenate.
Parallelization introduced in a way that is consistent with the refactor direction — behind a small, testable class — not as a new 500-line branch inside build_matrix_chunked.
No new Modal apps, volumes, or concurrency budget (see "Infrastructure impact" below).
Non-goals
Full Phase 4 extraction of MatrixAssembler / SimulationBatchEvaluator / TargetRepository / ConstraintEvaluator. That is a larger effort tracked by the refactor plan. This issue carves out the coordination layer only, which the refactor doc treats as a natural seam.
Changes to the non-chunked build_matrix() path. It will continue to exist for small/local callers.
Changes to L0 fitting, target querying, or take-up logic.
Schema/contract changes to calibration_package.pkl — that's Phase 1 of the refactor.
Proposed direction
Extract a coordinator class — proposed name ChunkedMatrixAssembler — in policyengine_us_data/calibration/. The class owns:
per-chunk progress / ETA logging (already implemented);
streaming final assembly into CSR (new — replaces the current list-then-concat).
The per-chunk kernel (materialize subset H5 → build Microsimulation → compute variables → write COO shard) is factored into a pure callable so the same function runs in-process today and inside a Modal worker tomorrow. The refactor's design principle applies: the kernel knows arrays, the dispatcher knows where they execute.
LocalChunkDispatcher runs kernels in the current process — the existing serial path, preserved for small runs, tests, and environments without Modal.
ModalChunkDispatcher spawns build_matrix_chunk_worker Modal functions batched by chunks_per_worker, following the build_areas_worker pattern (modal_app/local_area.py:442–454, max_containers=50, modal.Volume for shard coordination, .spawn() / .get() idiom). Shards land at /staging/{run_id}/matrix_chunks/chunk_XXXXXX.npz; the existing chunk_manifest.json machinery (lines 3197–3218) works unchanged.
CLI: add --parallel / --no-parallel and --num-matrix-workers to unified_calibration. These flags are only meaningful when the chunked path is active — on the single-pass build_matrix path there's nothing to fan out. When --parallel is passed without --chunked-matrix, silently ignore rather than error. Default --no-parallel during initial rollout; flip the default after the benchmark (see "Validation" below) confirms the parallel path is correct and stable.
Fan-out default: start at num_workers=50 to match the existing H5-worker convention (modal_app/local_area.py:452) and the num_workers input already wired through pipeline.yaml. At ~207 chunks that's ~4 chunks/worker — slightly finer-grained than H5 building but consistent. The per-chunk benchmark (Open Questions §1) can push this down if cold-start cost turns out to dominate or up if per-chunk wall time is large; 50 is the operating default, not a hard target.
Streaming final assembly replaces the current three-list concat. Two-pass approach: (1) walk shard headers to sum nnz, (2) preallocate data, indices, indptr arrays once and fill by iterating shards. Peak memory becomes the final CSR, not 2× it.
Relationship to the refactor plan
This issue implements a small, deliberate subset of refactor Phase 4. Specifically:
ChunkedMatrixAssembler is a precursor to the full MatrixAssembler extraction. It takes on orchestration but does not yet absorb target-repo / uprater / constraint-evaluator / simulation-batch-evaluator responsibilities from UnifiedMatrixBuilder. Those remain inside the builder for now.
ChunkDispatcher matches the refactor's push to make execution boundaries (local vs Modal) replaceable rather than hard-coded in orchestration code.
Per-chunk kernel purification aligns with refactor §376–378 ("side effects are too deeply embedded"). It does not finish the job, but it stops making it worse.
This keeps the refactor migration forward-compatible: when Phase 4 starts extracting MatrixAssembler proper, ChunkedMatrixAssembler becomes the orchestration caller for it, not a competitor.
Infrastructure impact
Zero net change to Modal footprint:
Apps: modal_app/pipeline.py registers one modal.App("policyengine-us-data-pipeline"). New @app.function decorators register inside it — no new apps.
Volumes: new shards live in the existing pipeline-artifacts (or local-area-staging) volume under a new /staging/{run_id}/matrix_chunks/ path. Ephemeral per-run; no durable growth.
Concurrency budget: max_containers=50 reused from the H5-worker ceiling, not added on top.
Billing: CPU-seconds × memory-GB-seconds are invariant to container count for the same total work. 1 worker × 64 GB × 14 h ≈ 50 workers × 64 GB × ~17 min ≈ 896 GB-hours. Cold-start overhead ≈ 20 s × 50 × 8 CPU ≈ 2 CPU-hours (rounding error). Net expected effect is neutral-to-favorable because today's sequential run can burn the full 14 h allocation on incomplete work that has to be re-run.
Prerequisite: make Modal pipeline boot again
Independent of parallelization, the deployed policyengine-us-data-pipeline Modal app has not successfully executed a single function call since #765 due to a container-boot import failure. Evidence: all six fc-... spawns for commits merged after #765 (fc-01KPF4SQNGRS82ZYM81FZ0DWCX, fc-01KPNAJGTJC6W1CHD6W3G0BS8T, fc-01KPPWKY5HJ07C9WEEZHGBN5VX, fc-01KPPZ6V3W2VZ9BJNT949CZKWA, fc-01KPQVWZQZ1ZKWCJ0RCBB35FMH, fc-01KPS3XADWAFFBVE01NRWACFR2) log the same trace in modal app logs ap-m5QsBgnRqhJtRMRKfA8myq --env main:
File "/root/policyengine-us-data/modal_app/local_area.py", line 32, ...
File "/root/policyengine-us-data/policyengine_us_data/__init__.py", line 3, in <module>
from .geography import ZIP_CODE_DATASET
File "/root/policyengine-us-data/policyengine_us_data/geography/__init__.py", line 2, in <module>
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
Root cause: modal_app/images.py::_base_image runs uv sync --frozen which installs deps into /root/policyengine-us-data/.venv/, but Modal boots the container with system Python, which has only uv.
Two aligned fixes, both consistent with the refactor doc:
Short-term (same issue / same PR): add .env({"VIRTUAL_ENV": ..., "PATH": "/root/.../.venv/bin:...", "PYTHONPATH": "/root/.../.venv/lib/python3.14/site-packages"}) after .run_commands(... uv sync ...) in modal_app/images.py. Unblocks boot without touching import semantics.
Long-term (tracked by the refactor plan): make policyengine_us_data/__init__.py and policyengine_us_data/geography/__init__.py import-safe in minimal environments — no module-scope pandas, no dataset reads at package load. This is refactor §414–430 ("import and dependency boundaries"). Out of scope here, referenced for continuity.
The short-term fix must land in the same PR as (or before) the parallelization work, because the parallelization cannot be verified end-to-end on Modal without it.
Open questions (resolve before implementation)
Per-chunk benchmark to sanity-check the 50-worker default. Production chunk count is ~207. The test fixture is 50 households / 100 cols (tests/integration/test_chunked_matrix_builder.py:180) — ~52,000× smaller than production, so we cannot extrapolate per-chunk wall time from tests. Proposed: after the venv fix lands and the first build_matrix_chunk_worker is deployable, run one worker on one production-scale 25K-col chunk on Modal, record wall time + peak RSS + shard nnz. With those numbers, verify that num_workers=50 (i.e. ~4 chunks per worker) gives acceptable wall time without cold-start overhead swamping useful work. Only adjust if the benchmark clearly says to.
Where does ChunkedMatrixAssembler live? Candidates: policyengine_us_data/calibration/chunked_matrix_assembler.py (new file) or keep it inside unified_matrix_builder.py below the existing class. Refactor doc prefers separate files per responsibility; I'd follow that.
Should the coordinator run outside the Modal container entirely? Currently the whole pipeline (including the matrix-builder step) runs inside one Modal function. If the coordinator spawns workers, it still needs to be inside some container to call .spawn(). Simplest is to keep the coordinator inside the existing matrix-building Modal function and let it fan out to workers — no architectural change. Flagging for confirmation.
Test strategy for ModalChunkDispatcher. Refactor doc prescribes unit tests with fakes. Proposed: a FakeChunkDispatcher used in unit tests to exercise ChunkedMatrixAssembler partition/collect/assemble logic without Modal; one real Modal smoke integration test gated behind an env flag.
Acceptance criteria
ChunkedMatrixAssembler class exists in policyengine_us_data/calibration/ with unit tests covering: partition correctness, resume/skip correctness with fake completed shards, streaming CSR assembly correctness on hand-built COO fixtures.
ChunkDispatcher interface with LocalChunkDispatcher and ModalChunkDispatcher implementations. FakeChunkDispatcher used in unit tests.
UnifiedMatrixBuilder.build_matrix_chunked() is a facade that delegates to ChunkedMatrixAssembler. Public signature unchanged. All existing build_matrix_chunked unit and integration tests pass against the facade with no edits.
Full-CPS parallel run completes on Modal via workflow_dispatch on pipeline.yaml with the new --parallel flag (in combination with --chunked-matrix). Wall time documented in the PR description.
--parallel without --chunked-matrix is a silent no-op (no behavior change vs. the non-chunked path).
Output CSR on a small fixture (existing tests/integration/test_chunked_matrix_builder.py fixture) is byte-identical (or nnz/row-sum/column-sum identical modulo stable COO ordering) between LocalChunkDispatcher and ModalChunkDispatcher.
Final-assembly peak memory profiled on a medium fixture is within ~1.1× of final-CSR size (streaming works).
Modal venv fix merged so the deployed app can boot (part of same PR unless a standalone fix lands first).
Towncrier changelog fragment added under changelog.d/ (one .changed.md for the parallelization, one .fixed.md for the venv activation).
Validation plan
Same as the "Verification" block of the earlier plan draft:
Unit: new tests + existing test_chunked_matrix_builder.py green against the facade-over-coordinator layout.
Modal deploy smoke: modal deploy modal_app/pipeline.py succeeds; workflow_dispatch run boots past module imports; modal app logs ap-m5QsBgnRqhJtRMRKfA8myq --env main shows the pipeline progressing into real work.
Single-chunk benchmark: one production-scale 25K-col chunk on Modal → record wall time, peak RSS, nnz → pick chunks_per_worker.
End-to-end parallel run: full ~207-chunk build across num_workers containers → CSR shape / nnz / row-sums match a serial baseline on a reduced fixture.
Resume: interrupt mid-flight, re-dispatch with same run_id, confirm completed chunks skip.
Memory profile: peak RSS during assembly ≤ ~1.1× final-CSR size on a medium fixture.
Summary
UnifiedMatrixBuilder.build_matrix_chunked()(introduced in #753) resolved the ~49.6 GB RSS OOM of the non-chunked path on full-CPS national runs, but it does not scale to production. At default settings a full run produces ~207 chunks of 25,000 clone-household columns each and executes them serially on one 8-CPU / 64 GB Modal container, which cannot finish inside the 14 h timeout (remote_calibration_runner.py:440). The final-assembly step atunified_matrix_builder.py:3489–3506also reintroduces a memory peak by loading every per-chunk COO shard into three Python lists beforenp.concatenate, partially undoing the gain from chunking.We want the chunked builder to finish full-scale production runs in a fraction of current wall time and under a memory budget that doesn't regress vs. per-chunk peak.
Context
Production scale (confirmed in-repo):
docs/methodology.md:422;calibration/create_stratified_cps.py:365(__main__defaulttarget = 12_000)docs/methodology.md:422;calibration/unified_calibration.py:65(DEFAULT_N_CLONES)docs/methodology.md:422("approximately 5.2 million total records entering calibration")docs/internals/README.md:130calibration/unified_matrix_builder.py:3071modal_app/remote_calibration_runner.py:440max_containers=50modal_app/local_area.py:452Note on prior confusion: the
create_stratified_cps_datasetfunction hastarget_households=30_000as its signature default, but that path is only taken by programmatic callers with no args. The production__main__entry point (whatdata_build.pyinvokes) hard-codestarget = 12_000.Current
build_matrix_chunked()(unified_matrix_builder.py:3065–3519) is a ~450-line method that owns partition, per-chunk materialization, per-chunk Microsimulation construction, COO shard writing, resume manifest validation, progress logging, and final CSR assembly. This matches the refactor doc's observation thatUnifiedMatrixBuilderis overloaded (refactor §6: target-repo work, uprating, simulation, geography indexing, constraint evaluation, take-up handling, sparse assembly, and now orchestration).Per-chunk work is structurally independent: the only cross-chunk state is
constraint_cache(built once at line 3136, immutable thereafter) and the geography arrays (read-only). This is embarrassingly parallel.Goals
build_matrix_chunked.Non-goals
MatrixAssembler/SimulationBatchEvaluator/TargetRepository/ConstraintEvaluator. That is a larger effort tracked by the refactor plan. This issue carves out the coordination layer only, which the refactor doc treats as a natural seam.build_matrix()path. It will continue to exist for small/local callers.calibration_package.pkl— that's Phase 1 of the refactor.Proposed direction
Extract a coordinator class — proposed name
ChunkedMatrixAssembler— inpolicyengine_us_data/calibration/. The class owns:ChunkDispatcher;chunk_manifest.json+ lineage signature, already implemented);The per-chunk kernel (materialize subset H5 → build
Microsimulation→ compute variables → write COO shard) is factored into a pure callable so the same function runs in-process today and inside a Modal worker tomorrow. The refactor's design principle applies: the kernel knows arrays, the dispatcher knows where they execute.LocalChunkDispatcherruns kernels in the current process — the existing serial path, preserved for small runs, tests, and environments without Modal.ModalChunkDispatcherspawnsbuild_matrix_chunk_workerModal functions batched bychunks_per_worker, following thebuild_areas_workerpattern (modal_app/local_area.py:442–454,max_containers=50,modal.Volumefor shard coordination,.spawn()/.get()idiom). Shards land at/staging/{run_id}/matrix_chunks/chunk_XXXXXX.npz; the existingchunk_manifest.jsonmachinery (lines 3197–3218) works unchanged.CLI: add
--parallel / --no-paralleland--num-matrix-workerstounified_calibration. These flags are only meaningful when the chunked path is active — on the single-passbuild_matrixpath there's nothing to fan out. When--parallelis passed without--chunked-matrix, silently ignore rather than error. Default--no-parallelduring initial rollout; flip the default after the benchmark (see "Validation" below) confirms the parallel path is correct and stable.Fan-out default: start at
num_workers=50to match the existing H5-worker convention (modal_app/local_area.py:452) and thenum_workersinput already wired throughpipeline.yaml. At ~207 chunks that's ~4 chunks/worker — slightly finer-grained than H5 building but consistent. The per-chunk benchmark (Open Questions §1) can push this down if cold-start cost turns out to dominate or up if per-chunk wall time is large; 50 is the operating default, not a hard target.Streaming final assembly replaces the current three-list concat. Two-pass approach: (1) walk shard headers to sum nnz, (2) preallocate
data,indices,indptrarrays once and fill by iterating shards. Peak memory becomes the final CSR, not 2× it.Relationship to the refactor plan
This issue implements a small, deliberate subset of refactor Phase 4. Specifically:
ChunkedMatrixAssembleris a precursor to the fullMatrixAssemblerextraction. It takes on orchestration but does not yet absorb target-repo / uprater / constraint-evaluator / simulation-batch-evaluator responsibilities fromUnifiedMatrixBuilder. Those remain inside the builder for now.ChunkDispatchermatches the refactor's push to make execution boundaries (local vs Modal) replaceable rather than hard-coded in orchestration code.This keeps the refactor migration forward-compatible: when Phase 4 starts extracting
MatrixAssemblerproper,ChunkedMatrixAssemblerbecomes the orchestration caller for it, not a competitor.Infrastructure impact
Zero net change to Modal footprint:
modal_app/pipeline.pyregisters onemodal.App("policyengine-us-data-pipeline"). New@app.functiondecorators register inside it — no new apps.pipeline-artifacts(orlocal-area-staging) volume under a new/staging/{run_id}/matrix_chunks/path. Ephemeral per-run; no durable growth.max_containers=50reused from the H5-worker ceiling, not added on top.Prerequisite: make Modal pipeline boot again
Independent of parallelization, the deployed
policyengine-us-data-pipelineModal app has not successfully executed a single function call since #765 due to a container-boot import failure. Evidence: all sixfc-...spawns for commits merged after #765 (fc-01KPF4SQNGRS82ZYM81FZ0DWCX,fc-01KPNAJGTJC6W1CHD6W3G0BS8T,fc-01KPPWKY5HJ07C9WEEZHGBN5VX,fc-01KPPZ6V3W2VZ9BJNT949CZKWA,fc-01KPQVWZQZ1ZKWCJ0RCBB35FMH,fc-01KPS3XADWAFFBVE01NRWACFR2) log the same trace inmodal app logs ap-m5QsBgnRqhJtRMRKfA8myq --env main:Root cause:
modal_app/images.py::_base_imagerunsuv sync --frozenwhich installs deps into/root/policyengine-us-data/.venv/, but Modal boots the container with system Python, which has onlyuv.Two aligned fixes, both consistent with the refactor doc:
.env({"VIRTUAL_ENV": ..., "PATH": "/root/.../.venv/bin:...", "PYTHONPATH": "/root/.../.venv/lib/python3.14/site-packages"})after.run_commands(... uv sync ...)inmodal_app/images.py. Unblocks boot without touching import semantics.policyengine_us_data/__init__.pyandpolicyengine_us_data/geography/__init__.pyimport-safe in minimal environments — no module-scope pandas, no dataset reads at package load. This is refactor §414–430 ("import and dependency boundaries"). Out of scope here, referenced for continuity.The short-term fix must land in the same PR as (or before) the parallelization work, because the parallelization cannot be verified end-to-end on Modal without it.
Open questions (resolve before implementation)
tests/integration/test_chunked_matrix_builder.py:180) — ~52,000× smaller than production, so we cannot extrapolate per-chunk wall time from tests. Proposed: after the venv fix lands and the firstbuild_matrix_chunk_workeris deployable, run one worker on one production-scale 25K-col chunk on Modal, record wall time + peak RSS + shard nnz. With those numbers, verify thatnum_workers=50(i.e. ~4 chunks per worker) gives acceptable wall time without cold-start overhead swamping useful work. Only adjust if the benchmark clearly says to.ChunkedMatrixAssemblerlive? Candidates:policyengine_us_data/calibration/chunked_matrix_assembler.py(new file) or keep it insideunified_matrix_builder.pybelow the existing class. Refactor doc prefers separate files per responsibility; I'd follow that..spawn(). Simplest is to keep the coordinator inside the existing matrix-building Modal function and let it fan out to workers — no architectural change. Flagging for confirmation.ModalChunkDispatcher. Refactor doc prescribes unit tests with fakes. Proposed: aFakeChunkDispatcherused in unit tests to exerciseChunkedMatrixAssemblerpartition/collect/assemble logic without Modal; one real Modal smoke integration test gated behind an env flag.Acceptance criteria
ChunkedMatrixAssemblerclass exists inpolicyengine_us_data/calibration/with unit tests covering: partition correctness, resume/skip correctness with fake completed shards, streaming CSR assembly correctness on hand-built COO fixtures.ChunkDispatcherinterface withLocalChunkDispatcherandModalChunkDispatcherimplementations.FakeChunkDispatcherused in unit tests.UnifiedMatrixBuilder.build_matrix_chunked()is a facade that delegates toChunkedMatrixAssembler. Public signature unchanged. All existingbuild_matrix_chunkedunit and integration tests pass against the facade with no edits.workflow_dispatchonpipeline.yamlwith the new--parallelflag (in combination with--chunked-matrix). Wall time documented in the PR description.--parallelwithout--chunked-matrixis a silent no-op (no behavior change vs. the non-chunked path).tests/integration/test_chunked_matrix_builder.pyfixture) is byte-identical (or nnz/row-sum/column-sum identical modulo stable COO ordering) betweenLocalChunkDispatcherandModalChunkDispatcher.changelog.d/(one.changed.mdfor the parallelization, one.fixed.mdfor the venv activation).Validation plan
Same as the "Verification" block of the earlier plan draft:
test_chunked_matrix_builder.pygreen against the facade-over-coordinator layout.modal deploy modal_app/pipeline.pysucceeds;workflow_dispatchrun boots past module imports;modal app logs ap-m5QsBgnRqhJtRMRKfA8myq --env mainshows the pipeline progressing into real work.chunks_per_worker.num_workerscontainers → CSR shape / nnz / row-sums match a serial baseline on a reduced fixture.run_id, confirm completed chunks skip.Related
US Data Pipeline Refactor.md(Phase 4 extraction ofUnifiedMatrixBuilder; §414–430 on import-time dependency boundaries).modal_app/local_area.py:442–454— reference pattern (build_areas_worker+max_containers=50).modal_app/pipeline.py— existing.spawn()/.get()usages andrun_id/volume conventions.fc-01KPF4SQNGRS82ZYM81FZ0DWCX,fc-01KPNAJGTJC6W1CHD6W3G0BS8T,fc-01KPPWKY5HJ07C9WEEZHGBN5VX,fc-01KPPZ6V3W2VZ9BJNT949CZKWA,fc-01KPQVWZQZ1ZKWCJ0RCBB35FMH,fc-01KPS3XADWAFFBVE01NRWACFR2.