A deterministic discrete-event simulator for blob-aggregation strategies in the fil.one storage layer. It answers one question with numbers instead of intuition:
Given how clients actually write and delete data, which packing / flush / delete-batching strategy minimizes cost without blowing the delete-latency SLA — and by how much?
It models how a storage provider (SP) aggregates client blobs into PDP pieces, commits them to chain, and times deletes, and it accounts for gas, storage cost, infra, client revenue, and delete-enactment latency (including the upper-bound SLA the team cares about) over a simulated horizon.
This README explains what is modeled, how, and why the outputs are the right things to measure.
For the underlying math see docs/MODELING.md; for the architecture see
docs/DESIGN.md; for the Go contract see docs/INTERFACES.md.
fil.one sits between storage clients and Filecoin storage providers:
clients ──$/TB/month──▶ fil.one ──$/TB/month──▶ SP ──gas per chain op──▶ Filecoin (PDP)
(S3 PUT/DELETE) (storage) + disk/power (infra)
Clients PUT arbitrarily-sized blobs. The SP can't put every tiny blob on chain individually — gas would dwarf the data — so it aggregates many blobs into one PDP piece (a committed root CID) and commits pieces to a data set that is proven on chain every proving period. Three decisions drive the whole cost structure, and they fight each other:
| Lever | Wait longer / pack bigger ⇒ | Cost of doing so |
|---|---|---|
| How to pack (aggregate size) | fewer adds per byte; gas amortized | a delete inside a big piece forces a bigger rewrite |
| When to flush (to chain) | bigger, cheaper batches | data sits off-chain longer (held but unbilled/at-risk) |
| How long to batch deletes | fewer remove txs; gas saved | higher delete-enact latency (SLA risk) + longer paying-for-deleted-data |
The punchline the team cares about: batching is not free. It trades gas for latency and for a window where the SP stores (and proves, and pays for) data the client already asked to delete. There is no closed-form answer because it depends on the workload — the size distribution and the churn (how fast blobs die). So we simulate.
Why a simulator and not a spreadsheet: the costs are time-integrals over a changing system (bytes on chain rise and fall as pieces are added, proven each period, and removed) and the deletes are path-dependent (which blobs share a piece at write time determines the rewrite cost at delete time). A spreadsheet can't capture "this blob died while still buffered so it never hit chain" or "these two deletes landed in the same batch window so they shared one remove tx." A discrete-event model can, exactly.
Every output maps to a real dollar or a real SLA commitment. A run prints (see §6 for a full table):
| Metric | What it is | Why it's the metric that matters |
|---|---|---|
| gas: add / remove / proving / create | per PDP op class, in FVM units + FIL (price-independent) and USD | gas is the SP's per-op chain cost; broken out so you see which op dominates, and reported in FIL so it survives FIL-price changes |
| gas: proving (recurring) | per-data-set per-period proving cost | the cost floor at low churn — it runs forever regardless of activity |
| write amplification | chain bytes written ÷ unique bytes stored | how much survivor data gets rewritten on deletes — the hidden cost of packing |
| zombie storage % | deleted-but-not-yet-removed byte-seconds ÷ total | the share of storage-time spent on data the client already deleted (pure loss) |
| delete latency mean/p95/p99/MAX | request → on-chain removal | the SLA metric; MAX is the headline upper bound, not the average |
| SLA violations | per-client deletes past sla_max_delete |
direct contract-breach count |
| time-to-chain | arrival → proven on chain | how long data is at-risk / off-chain before it's committed |
| revenue / storage / infra / margins | the money | split across the two balance sheets (see §3.5) |
The design principle: report the upper bound, not just the average. Delete latency is reported as a full distribution with MAX first-class, because an SLA is a promise about the worst case.
- Blob — one client PUT:
id, clientID, sizeBytes, arrivalTime, deleteTime, state. - Root / piece — a committed PDP root CID holding one or more blobs; the unit added to / removed
from chain. Tracks
generation(how many rewrites produced it). - DataSet — a PDP proof set: a collection of pieces proven together every proving period.
- ChainOp / confirm effect — a queued add/remove that takes effect after a confirmation delay.
A blob walks a fixed state machine (internal/model): Pending (arrived, buffered) → OnChain
(its piece is proven) → MarkedDeleted (delete requested, still on chain) → Removed (removal
enacted on chain). Every cost is attached to a specific transition, so nothing is double-counted and
nothing is free that shouldn't be.
internal/engine is a min-heap of events ordered by (time, sequence). The loop pops the earliest
event, advances the clock, dispatches to a handler, and handlers schedule new events. Event kinds:
BlobArrival, BlobDelete, AggregateTick, BatcherTick, DeleteTick, ChainConfirm, ProvingPeriod, WarmupEnd.
Determinism is a first-class property. Same config + same seed ⇒ byte-identical results. RNG is
a registry of independent streams keyed by (clientID, purpose) (internal/engine/rng.go), so
adding a client or a new random draw never perturbs the existing streams — every strategy in a
comparison sees the exact same workload. This is what makes a compare or sweep apples-to-apples:
the only thing that changes between columns is the strategy, never the traffic.
Clients are defined as archetypes with a count: one entry spawns N statistically-identical,
independently-seeded clients, so you scale from 1 to thousands by changing one number. Each archetype
has a Poisson arrival rate and configurable size and lifetime (churn) distributions:
const, uniform, normal, lognormal, exponential, pareto, and a histogram type that replays
measured telemetry (sampling log-uniformly within a bucket so values vary inside ranges, not just at
the named edges). The point: you can drive the model with real measured write-size and churn
distributions, not just toy parametric ones.
Everything under test is a pluggable interface (internal/strategy), so a study is a config change,
not a code change:
- Aggregator (how blobs pack, when a piece seals):
none,fixed_size,time_window,size_or_time,churn_aware(bucket blobs by predicted lifetime so whole pieces die together). - Batcher (when queued chain ops submit):
immediate,fixed_interval,size_threshold. - DeletePolicy (when marked-deleted blobs are compacted):
immediate,batched,sla_bounded, andgarbage_collected— the compaction-timing lever: tombstone deletes and compact an aggregate only once its garbage fraction crosses athreshold(the design knob), with an optionalmax_agecap and SLA force. - Rewrite model:
full(the faithful PDP model) vspartial(a counterfactual lower bound, not buildable) — see §3.6.
Gas is contract-grounded and not flat per tx. AddPieces and ProvePossession scale
logarithmically with the data set's current piece count (and proving also with byte size), fit
to calibnet PDP measurements; the remove/create-dataset ops and the FilecoinWarmStorageService
service-contract surcharge are grounded against the PDPVerifier.sol/FilecoinWarmStorageService.sol
source — see docs/GAS_GROUNDING.md for the full methodology, the EVM↔FVM unit problem, and what
is measured vs. projected (internal/cost/gas.go, anchors in docs/MODELING.md §2):
G_add(n) = add.base + add.per_ln_piece · ln(n) # one-time, on add
G_prove(n,B) = prove.base + prove.per_ln_piece·ln(n) + prove.per_ln_byte·ln(B) # recurring
G_next = next.base # recurring, ~constant
G_rem(k) = rem.base + rem.per_piece · k # one tx drops k pieces
gas in FIL = units · gas_price · 1e-18 gas in USD = gas_FIL · fil_usd # post-hoc
Prices are applied post-hoc, so you can re-price a run instead of re-running it. The sim
accumulates only price-independent invariants — FVM gas units and TB-months — and reports gas in
both FVM gas and FIL, storage in TB-months. Pricing is a final scalar layer, matched to
each flow's real settlement token: gas settles in FIL (so FIL/USD, which is volatile — ~$0.70
today — applies to gas only); storage/revenue settle in USDFC (USD-pegged, so the $/TB/month
rate applies and they don't move with FIL). pdp-sim reprice -i summary.json --fil-usd 0.70
re-prices a saved run in milliseconds — making gas-price/storage-price sensitivity a reprice, not
a re-run. See docs/GAS_GROUNDING.md.
The recurring proving cost is the floor: for every active data set, every proving period, the SP
pays G_prove + G_next whether or not anything changed. At low churn this dominates total gas — so
minimizing the number of data sets matters more than minimizing adds, a non-obvious result the
model surfaces directly. A ProvingPeriod event fires per data set and re-reads its current
(piece count, bytes), so the floor tracks the data set as it grows and shrinks.
The ledger is an exact time-integral, not a sampler (internal/cost/ledger.go). It advances the
clock and accrues value · Δt on every byte-count change, so storage/revenue integrals are exact
to the event, not approximated by periodic sampling. It tracks three distinct byte counts because
they have different lifetimes, and conflating them is the usual way these models go wrong:
| Quantity | Window | Drives | Whose money |
|---|---|---|---|
| proven bytes | piece on chain → removed | storage payment S_sp |
fil.one pays SP |
| held bytes | arrival (PUT) → removal enacted | infra S_infra (disk/power) |
SP's own cost |
| billed bytes | arrival → delete request (default) | revenue R_c |
clients pay fil.one |
Held ⊇ proven: the extra is the off-chain holding window ([arrival, on-chain]) where the SP is
already burning disk but not yet being paid — modeling infra on held (not proven) bytes is what
makes lazy flush show its true holding cost.
Margin is split across the two balance sheets, because the storage payment is an internal transfer (fil.one's cost, the SP's revenue) and cancels in the consolidated view — subtracting it once on each side, as a naive single "margin" does, is double-counting:
fil.one margin = Revenue − StoragePayment
SP margin = StoragePayment − Infra − Gas
system margin = Revenue − Infra − Gas (= fil.one + SP; the transfer cancels)
When blob b in aggregate r is deleted, b is marked deleted at t_d (billing stops, the zombie
clock starts) and the DeletePolicy decides when to compact. deleteLatency = enactTime − t_d.
PDP cannot remove a member from a committed aggregate — removePieces drops whole aggregates,
with no operation to excise one blob. So to reclaim a deleted blob's space you must remove the whole
aggregate and re-add a new one without it (strategy.rewrite.type: full, the faithful model;
survivors are re-committed → write amplification, up to 241× on 1 GiB aggregates with eager deletes).
partial (shrink-in-place / sub-piece removal) is kept only as a counterfactual lower bound to
quantify what that constraint costs — it is not a buildable option, so don't read full-vs-partial
as a design choice. The real levers under forced full-rewrite are aggregate size, churn-aware
co-location, and compaction timing.
Removal is scheduled, not immediate. A removePieces only takes effect when the proving period
it was scheduled in passes — at the next proving boundary for that data set. So delete latency has a
floor of up to one proving period, the deleted bytes stay proven and paid-for until then, and
during a rewrite the survivors are double-proven (old + new aggregate) over the lag. These are
real costs the model now charges; sla_bounded submits confirm_delay + proving_period earlier to
still meet the SLA.
Compaction timing is the real lever (the garbage_collected policy). Since the rewrite is
unavoidable, the SP decides when to pay it: tombstone deletes and rewrite an aggregate only once
its garbage fraction (dead ÷ total bytes) crosses a threshold. One knob sweeps the whole trade-off
— at threshold = 1.0 (the default) an aggregate is dropped only when fully dead (write amp ~1.0,
but max zombie storage and latency); lower thresholds rewrite eagerly (higher write amp/gas, less
zombie). An optional max_age cap bounds tombstone debt, and an SLA force overrides a lazy
threshold. See docs/MODELING.md §5.3 for the measured trade-off curve.
- Confirmation delay: a fixed latency from submit to on-chain (no reorgs modeled). It correctly
propagates into the SLA logic —
sla_boundedsubmitsconfirm_delayearlier so the enactment, not the submission, meets the deadline. - Warmup:
simulation.warmupexcludes a cold-start window from the reported totals (the ledger and metrics reset at the boundary while system state carries forward), so you measure steady state.
This is the part to show a skeptical reviewer. Confidence comes from four places, not from "the code looks right":
- Determinism. Same config + seed ⇒ byte-identical totals (
TestDeterminism). Results are reproducible and comparisons are controlled — strategy is the only variable. - Analytic invariants hold.
no-aggregation⇒ pieces == blobs and write amplification == exactly 1.0;immediatedelete with zero confirm delay ⇒ every delete enacted instantly (TestNoAggregationInvariants). These have known closed-form answers and the simulator hits them. - Conservation is enforced. A regression test asserts the ledger's proven-byte count always
equals the bytes of pieces actually on chain (
TestQueuedAddDeleteRaceConserved,TestPartialRewrite) — bytes can't be created, lost, or double-counted as pieces are added, rewritten, partially shrunk, and removed across the confirm delay. Each of these tests was confirmed to fail on the pre-fix code, so they're guarding real bugs, not tautologies. - SLA guarantees are tested under adversarial timing.
sla_bounded⇒ zero violations andmax latency ≤ SLA, verified with both zero and 12-hour confirm delays (TestSLABounded).
And the model behaves sensibly under sweeps — e.g. tightening the delete-batch interval trades a monotonic decrease in zombie storage and latency for a monotonic increase in write-amp and gas, a coherent five-metric trade-off curve rather than noise. Coherence across independent metrics is itself evidence the model is internally consistent.
Intellectual honesty about the model's edges is part of the argument. The relative ranking of strategies is trustworthy earlier than the absolute USD figures.
| Grounded | Placeholder / assumption |
|---|---|
| AddPieces, ProvePossession, NextProvingPeriod gas (calibnet anchors) | RemovePieces & CreateDataSet gas — not yet in the benchmark set |
| Billing structure (per-TB-month, paid on proven bytes) | Gas price & FIL→USD — set per experiment |
| Log-scaling of add/prove with piece count | Size histogram default has overlapping buckets (needs a clean disjoint set) |
| Recurring proving floor exists and dominates at low churn | Confirmation modeled as fixed delay; no reorgs |
| Full-rewrite forced by PDP + scheduled (proving-period) removal | Compaction timing (when to do the unavoidable rewrite) not yet a tunable lever |
All of these are config inputs or flagged open questions (docs/MODELING.md §7-8), not hidden
fudge factors. Treat absolute USD as indicative until the placeholder gas is measured on the target
network; treat comparisons between strategies as valid now.
go build ./...
go test ./... # runs the validation suite in §4# Single run from a config (wired with uber-go/fx)
go run . run -c config.example.yaml
# Compare strategies side by side — same workload/seed, apples-to-apples
go run . compare -c configs/fixed-size.yaml -c configs/fixed-size-partial.yaml
# Sweep one key across values (numeric OR categorical), with a progress bar
go run . sweep -c configs/fixed-size.yaml --key strategy.delete_policy.params.interval --values 600,21600,86400
go run . sweep -c configs/fixed-size.yaml --key strategy.rewrite.type --values full,partial
# Re-price a saved run under different prices — NO re-run. Gas re-prices via FIL/USD; the
# USDFC storage rate re-prices storage independently (prices are post-hoc, §3.5).
go run . run -c config.example.yaml -o /tmp/run
go run . reprice -i /tmp/run/summary.json --fil-usd 0.70 --sp-tb-month 2.27Output formats are set per-config under output.formats (table, json, csv); -o <dir>
overrides the output directory. The output reports gas in FVM units and FIL (price-independent)
plus USD, and storage in TB-months — so a saved run can be re-priced. A single-run table:
write amplification 1.790
roots created 96
of which rewrites 57 (max gen 3)
gas: TOTAL $10.51
gas: TOTAL (FIL) 2.102121 FIL # price-independent SP gas burden
gas: TOTAL (FVM gas) 2.102e+09
proven storage (TB-months) 0.00
revenue (clients→fil.one) $0.01
storage payment (fil.one→SP) $0.01
fil.one margin (rev−storage) $0.00
SP margin (storage−infra−gas) $-10.51
system margin (rev−infra−gas) $-10.51
zombie storage % 41.96%
(Small absolute dollars here are just a toy run; scale duration and volume up for realistic magnitudes. Gas dominates at this scale because the proving floor and per-op costs don't shrink with a tiny dataset.)
cmd/ cobra CLI (run / compare / sweep / reprice), fx wiring
internal/engine/ deterministic event heap + seeded RNG registry
internal/model/ Blob, Root (piece), DataSet, ChainOp, state machine
internal/workload/ client archetypes + distributions (incl. histogram replay)
internal/strategy/ Aggregator / Batcher / DeletePolicy + registry (the experiment surface)
internal/cost/ PDP gas model (FVM units) + physical ledger + post-hoc PriceSet (FIL/USD)
internal/metrics/ latency & time-to-chain distributions, write-amp, zombie, generations, SLA
internal/report/ terminal tables + JSON/CSV export + reprice
internal/sim/ orchestrator: event handlers tying it all together + validation tests
configs/ example strategy variants for compare/sweep
bench/ PiB-scale reference workload + golden-output regression guard
docs/ DESIGN.md, MODELING.md (the math), INTERFACES.md, GAS_GROUNDING.md
analysis/ sensitivity scan + FINDINGS.md, SUMMARY.md (team-facing)
Dual-licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option. Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.