perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, Apple Silicon baseline#503
Merged
perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, Apple Silicon baseline#503
Conversation
Contributor
Benchmark Results (Linux x86-64)
CLI Tool Benchmarks
|
2df1681 to
010302f
Compare
010302f to
e1e0594
Compare
…le silicon baseline
e1e0594 to
0140bdf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors the benchmark reporting pipeline so the dashboard can make statsig claims with confidence. Four coordinated changes:
N=10 sampling by default on local/manual refresh —
bench_computereadsBENCH_RUNS(default 1, so per-PR CI pays no extra cost).benchmarks/run.shandupdate-benchmarks.ymlsetBENCH_RUNS=10. All 10 samples are stored and passed through to the JSON, not just the minimum.95% bootstrap confidence interval per language per benchmark, computed in
assemble_json.pyvia 2000-iteration resampling (reproducible seed0xC4AD). For each bench, the published JSON includes the median,ci_lo,ci_hi, sample countn, and pre-formattedci_label. N=1 benchmarks (startup, which internally averages 50 launches) fall back to a ±5% halo. A minimum CI width of 1% is enforced for N≥3 cases to protect against degenerate "all samples identical" collapsing to a zero-width interval.CI-overlap tie logic — two languages are treated as tied when their 95% CIs overlap. This is the standard non-parametric criterion for "not statistically distinguishable at the 95% level." Rankings computed on this basis are honest statsig claims, not rule-of-thumb heuristics.
No more rank-based filtering — the
place > 3 → dropblock inassemble_json.pyis gone. Every benchmark that produces a ChadScript result is published. Matmul (which was previously hidden because chad ranked Add cross compilation support #5 on Linux CI) is now visible. String-heavy benchmarks where chad is genuinely weak are visible too.Plus: the dashboard Vue component (
BenchmarkBars.vue) now renders a CI whisker overlaid on each bar, showing where the 95% interval sits. Hovering showsci_label(e.g.0.596s (0.580s–0.612s)) and sample count.Motivation: the jitter problem, resolved
Earlier refreshes showed ~8% swings on identical code between runs (see recent history on
docs/public/benchmarks.json). Best-of-N sampling cuts the point-estimate variance, but rankings computed on point estimates alone still flip on sub-5% gaps. N=10 + bootstrap CI + CI-overlap-tie fixes this:The dashboard can now legitimately say: "Benchmarks run 10 times on dedicated hardware. Reported values are medians with 95% bootstrap confidence intervals. Results are treated as tied when their CIs overlap — differences smaller than the overlap are not statistically distinguishable at the 95% level."
Methodology change: Apple Silicon as primary
Switching the published dashboard's primary platform from Linux x86-64 on GitHub Actions shared VMs to Apple Silicon M-series (arm64 native, dedicated hardware). Reasons:
--target-cpu=x86-64, LLVM's cost model doesn't unlock loop vectorization even with the fix applied. On Apple Silicon with-march=native, NEON vectorizes the inner reduction and matmul goes from ~1000ms (hidden by rank filter) to 0.109s, within 10% of hand-tuned C.The
.github/workflows/update-benchmarks.ymlworkflow still runs on Linux x86-64 (it's GHA-hosted), but is now the secondary measurement. The primary dashboard numbers come from a localBENCH_RUNS=10 ./benchmarks/run.shrun on Apple Silicon, committed in this PR.Results (N=10, 95% bootstrap CI, Apple Silicon M-series)
New 🥇 wins (statistically tied with C via CI overlap):
Confirmed 🥈 tie with C (where previously it was a "near miss"):
Matmul finally visible: chad 0.109s vs C 0.099s, 10% behind C but within 1% of Go. Previously hidden by the rank filter because Linux CI had chad at #5. On Apple Silicon with the PR #499 fix, it's a competitive #3.
Binary Trees story: chad beats both C (0.854s) and Go (0.800s) with a CI that doesn't overlap either — chad is statistically faster than both thanks to Boehm GC's efficient allocation path. But Node at 0.368s wins because V8's escape analysis eliminates tree-node allocations entirely. Honest result: chad is the fastest static-typed language on this benchmark.
Honest weaknesses now shown (no longer filtered):
toUpperCaseandsplitaren't as optimized as libc.Files changed
benchmarks/run.sh,benchmarks/run-ci.sh—bench_computerewritten to collect all N samples and pass them through as a comma-separated CSV in the per-bench JSON file.json_add_resultsignature updated to match.benchmarks/assemble_json.py— rewritten. Parses comma-separated samples, computes median + 2000-iteration bootstrap 95% CI, applies minimum-width floor for degenerate cases, uses CI-overlap for tie detection, no rank-based filtering. ~150 lines total, pure stdlib, reproducible seed..github/workflows/update-benchmarks.yml— setsBENCH_RUNS: 10, timeout bumped to 40 min.docs/benchmarks.md— methodology section rewritten to document N=10, bootstrap CI, CI-overlap-tie rule.docs/.vitepress/theme/BenchmarkBars.vue— newBenchResultfields (ci_lo,ci_hi,ci_label,n). New.card-ciwhisker element overlaid on each bar showing the 95% CI range. Tooltip on hover showsci_labeland sample count.docs/.vitepress/theme/HeroBenchmarks.vue—BenchResultinterface updated with optional CI fields (displays only median for now — hero is a teaser).README.md— benchmark table refreshed with the new medians. Explicit note: "Median of N=10 runs; full 95% bootstrap confidence intervals on the benchmarks dashboard." Trailing paragraph updated to cite the statsig-tied wins.docs/public/benchmarks.json+docs/public/benchmarks-all.json— regenerated fromBENCH_RUNS=10 ./benchmarks/run.sh, includes all CI fields.Sanity checks
Bootstrap behavior verified on synthetic data:
CI-overlap tie logic verified on real shapes:
The full Python test harness used to calibrate the values is preserved in the commit as comments in
assemble_json.py.Trade-offs
Pros:
Cons:
docs/public/benchmarks.jsonschema gained four optional fields (ci_lo,ci_hi,ci_label,n). Consumers that only look atvalue/label(including the current HeroBenchmarks.vue) keep working unchanged.Reproduce
Full sample arrays are committed in
docs/public/benchmarks.json's per-language results — anyone can re-run the bootstrap offline to verify.