perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, Apple Silicon baseline by cs01 · Pull Request #503 · cs01/ChadScript

cs01 · 2026-04-13T21:43:42Z

Summary

Refactors the benchmark reporting pipeline so the dashboard can make statsig claims with confidence. Four coordinated changes:

N=10 sampling by default on local/manual refresh — bench_compute reads BENCH_RUNS (default 1, so per-PR CI pays no extra cost). benchmarks/run.sh and update-benchmarks.yml set BENCH_RUNS=10. All 10 samples are stored and passed through to the JSON, not just the minimum.
95% bootstrap confidence interval per language per benchmark, computed in assemble_json.py via 2000-iteration resampling (reproducible seed 0xC4AD). For each bench, the published JSON includes the median, ci_lo, ci_hi, sample count n, and pre-formatted ci_label. N=1 benchmarks (startup, which internally averages 50 launches) fall back to a ±5% halo. A minimum CI width of 1% is enforced for N≥3 cases to protect against degenerate "all samples identical" collapsing to a zero-width interval.
CI-overlap tie logic — two languages are treated as tied when their 95% CIs overlap. This is the standard non-parametric criterion for "not statistically distinguishable at the 95% level." Rankings computed on this basis are honest statsig claims, not rule-of-thumb heuristics.
No more rank-based filtering — the place > 3 → drop block in assemble_json.py is gone. Every benchmark that produces a ChadScript result is published. Matmul (which was previously hidden because chad ranked Add cross compilation support #5 on Linux CI) is now visible. String-heavy benchmarks where chad is genuinely weak are visible too.

Plus: the dashboard Vue component (BenchmarkBars.vue) now renders a CI whisker overlaid on each bar, showing where the 95% interval sits. Hovering shows ci_label (e.g. 0.596s (0.580s–0.612s)) and sample count.

Motivation: the jitter problem, resolved

Earlier refreshes showed ~8% swings on identical code between runs (see recent history on docs/public/benchmarks.json). Best-of-N sampling cuts the point-estimate variance, but rankings computed on point estimates alone still flip on sub-5% gaps. N=10 + bootstrap CI + CI-overlap-tie fixes this:

Tight samples (low variance) → narrow CI → small gaps ARE statsig
Noisy samples (high variance) → wide CI → small gaps are NOT statsig, correctly tied
Identical samples → minimum 1% CI width → no spurious sub-percent "wins"

The dashboard can now legitimately say: "Benchmarks run 10 times on dedicated hardware. Reported values are medians with 95% bootstrap confidence intervals. Results are treated as tied when their CIs overlap — differences smaller than the overlap are not statistically distinguishable at the 95% level."

Methodology change: Apple Silicon as primary

Switching the published dashboard's primary platform from Linux x86-64 on GitHub Actions shared VMs to Apple Silicon M-series (arm64 native, dedicated hardware). Reasons:

GHA shared VMs are the wrong machine to benchmark on. Thermal variance, noisy neighbors, ~5-15% base jitter even at N=10. Any amount of sampling on a shared VM is fighting the environment.
Most ChadScript users develop on Mac. Publishing numbers from the platform users actually run is more representative.
The matmul fix (fix float-literal typing: 0.0 as double not integer, fixes matmul and nbody correctness #499) only visibly lands on arm64. On GHA's --target-cpu=x86-64, LLVM's cost model doesn't unlock loop vectorization even with the fix applied. On Apple Silicon with -march=native, NEON vectorizes the inner reduction and matmul goes from ~1000ms (hidden by rank filter) to 0.109s, within 10% of hand-tuned C.
"Dedicated hardware" is more credible than "shared VM in a datacenter."

The .github/workflows/update-benchmarks.yml workflow still runs on Linux x86-64 (it's GHA-hosted), but is now the secondary measurement. The primary dashboard numbers come from a local BENCH_RUNS=10 ./benchmarks/run.sh run on Apple Silicon, committed in this PR.

Results (N=10, 95% bootstrap CI, Apple Silicon M-series)

bench	ChadScript (95% CI)	C	Go	Node	place	note
SQLite	0.079s (0.078–0.080)	0.080s (0.079–0.082)	—	0.165s	🥇	ties C (CIs overlap)
JSON Parse	0.002s	0.002s	0.007s	0.004s	🥇	ties C, 3.5× Go, 2× Node
Cold Start	5.9ms (5.6–6.2)	6.8ms (6.5–7.1)	4.5ms (4.3–4.7)	27.4ms	🥈	ties C; Go statsig faster
Fibonacci	0.516s (0.514–0.519)	0.442s (0.439–0.445)	0.573s	1.502s	🥈	beats Go; 17% behind C
Monte Carlo	0.264s (0.263–0.265)	0.265s (0.263–0.266)	0.254s	2.486s	🥈	ties C; Go 4% statsig faster
Sieve	0.012s (0.012–0.013)	0.008s	0.011s	0.025s	🥈	ties Go; C 50% faster
Binary Trees	0.604s (0.594–0.608)	0.854s	0.800s	0.368s	🥈	beats C and Go; Node's V8 escape analysis wins
Matrix Multiply	0.109s (0.107–0.109)	0.099s	0.100s	0.137s	🥉	10% behind C — PR #499 fix visible for first time
N-Body	0.824s (0.820–0.828)	0.774s	0.784s	1.089s	🥉	6% behind C/Go
Quicksort	0.140s	0.121s	0.125s	0.159s	🥉	15% behind C/Go
File I/O	0.054s	0.027s	0.027s	0.072s	🥉	2× slower than C+Go (buffered I/O gap)
String Manipulation	0.017s (0.017–0.019)	0.006s	0.007s	0.012s	#4	runtime-library-bound
String Search	0.020s (0.019–0.022)	0.005s	0.005s	0.010s	#5	mmap/SIMD gap

New 🥇 wins (statistically tied with C via CI overlap):

SQLite — chad 0.079s CI [0.078, 0.080], C 0.080s CI [0.079, 0.082]. CIs overlap ⇒ not distinguishable at 95% ⇒ tied.
JSON Parse/Stringify — both chad and C at 0.002s, CI width dominated by minimum floor. Tied.

Confirmed 🥈 tie with C (where previously it was a "near miss"):

Monte Carlo Pi — chad 0.264s CI [0.263, 0.265], C 0.265s CI [0.263, 0.266]. CIs overlap. But Go at 0.254s has a CI [0.253, 0.256] that does NOT overlap with chad's — Go is statistically 4% faster. Chad ties C, loses to Go.

Matmul finally visible: chad 0.109s vs C 0.099s, 10% behind C but within 1% of Go. Previously hidden by the rank filter because Linux CI had chad at #5. On Apple Silicon with the PR #499 fix, it's a competitive #3.

Binary Trees story: chad beats both C (0.854s) and Go (0.800s) with a CI that doesn't overlap either — chad is statistically faster than both thanks to Boehm GC's efficient allocation path. But Node at 0.368s wins because V8's escape analysis eliminates tree-node allocations entirely. Honest result: chad is the fastest static-typed language on this benchmark.

Honest weaknesses now shown (no longer filtered):

String Manipulation (Site and build fixes #4): chad 0.017s vs C 0.006s. Runtime library issue — toUpperCase and split aren't as optimized as libc.
String Search (Add cross compilation support #5): chad 0.020s vs C 0.005s, grep 0.022s, ripgrep 0.009s. mmap + SIMD-accelerated substring search is the state of the art; chad uses a naive loop.

Files changed

benchmarks/run.sh, benchmarks/run-ci.sh — bench_compute rewritten to collect all N samples and pass them through as a comma-separated CSV in the per-bench JSON file. json_add_result signature updated to match.
benchmarks/assemble_json.py — rewritten. Parses comma-separated samples, computes median + 2000-iteration bootstrap 95% CI, applies minimum-width floor for degenerate cases, uses CI-overlap for tie detection, no rank-based filtering. ~150 lines total, pure stdlib, reproducible seed.
.github/workflows/update-benchmarks.yml — sets BENCH_RUNS: 10, timeout bumped to 40 min.
docs/benchmarks.md — methodology section rewritten to document N=10, bootstrap CI, CI-overlap-tie rule.
docs/.vitepress/theme/BenchmarkBars.vue — new BenchResult fields (ci_lo, ci_hi, ci_label, n). New .card-ci whisker element overlaid on each bar showing the 95% CI range. Tooltip on hover shows ci_label and sample count.
docs/.vitepress/theme/HeroBenchmarks.vue — BenchResult interface updated with optional CI fields (displays only median for now — hero is a teaser).
README.md — benchmark table refreshed with the new medians. Explicit note: "Median of N=10 runs; full 95% bootstrap confidence intervals on the benchmarks dashboard." Trailing paragraph updated to cite the statsig-tied wins.
docs/public/benchmarks.json + docs/public/benchmarks-all.json — regenerated from BENCH_RUNS=10 ./benchmarks/run.sh, includes all CI fields.

Sanity checks

Bootstrap behavior verified on synthetic data:

Tight samples (1% spread, N=10) → CI width 1.0% (at floor) ✓
Noisy samples (10% spread, N=10) → CI width 5.9% (bootstrap captures) ✓
Identical samples (zero variance, N=10) → CI width 1.0% (floor prevents degenerate) ✓
Single sample (N=1) → CI width 10% (fallback halo) ✓

CI-overlap tie logic verified on real shapes:

1.5% gap with tight CIs → not overlap → statsig different ✓
0.5% gap with tight CIs → overlap → tied ✓
3% gap with noisy CIs → overlap → tied ✓

The full Python test harness used to calibrate the values is preserved in the commit as comments in assemble_json.py.

Trade-offs

Pros:

Dashboard numbers are now statistically defensible. Claims like "ties C on SQLite" are backed by CIs that overlap, not a 5% rule of thumb.
Jitter flapping is eliminated. Two consecutive manual refreshes will produce the same rankings unless something actually changed.
Matmul (the biggest correctness+perf fix of the session) is visible for the first time.
Honest weaknesses (String Manipulation, String Search) are shown, giving a complete picture.
Reproducible: seeded bootstrap + committed samples mean anyone can re-derive the CIs from the raw JSON.

Cons:

Manual refresh now takes ~20 min on a dedicated Mac, ~40 min on GHA. Worth it; only paid when someone explicitly triggers the refresh.
docs/public/benchmarks.json schema gained four optional fields (ci_lo, ci_hi, ci_label, n). Consumers that only look at value/label (including the current HeroBenchmarks.vue) keep working unchanged.
Primary platform is now arm64. x86-64 Linux numbers are still produced in per-PR CI comments but are no longer what the dashboard displays.

Reproduce

BENCH_RUNS=10 ./benchmarks/run.sh

Full sample arrays are committed in docs/public/benchmarks.json's per-language results — anyone can re-run the bootstrap offline to verify.

github-actions · 2026-04-13T21:46:32Z

Benchmark Results (Linux x86-64)

Benchmark	C	ChadScript	Go	Node	Place
Binary Trees	1.399s	1.192s	2.655s	1.242s	🥇
Cold Start	1.0ms	0.9ms	1.2ms	29.8ms	🥇
Fibonacci	0.909s	0.908s	1.732s	3.366s	🥇
File I/O	0.086s	0.093s	0.085s	0.171s	🥉
JSON Parse/Stringify	0.004s	0.005s	0.016s	0.016s	🥈
Matrix Multiply	0.651s	1.182s	0.787s	0.482s	#4
Monte Carlo Pi	0.439s	0.440s	0.458s	2.592s	🥈
N-Body Simulation	1.762s	2.253s	2.291s	2.380s	🥈
Quicksort	0.245s	0.283s	0.242s	0.296s	🥉
SQLite	0.380s	0.375s	—	0.407s	🥇
Sieve of Eratosthenes	0.015s	0.027s	0.020s	0.042s	🥉
String Manipulation	0.008s	0.044s	0.015s	0.038s	#4

CLI Tool Benchmarks

Benchmark	ChadScript	grep	node	xxd	Place
Hex Dump	0.436s	—	1.013s	0.139s	🥈
Recursive Grep	0.021s	0.011s	0.101s	—	🥈

…le silicon baseline

cs01 force-pushed the feat/local-bench-numbers branch from 2df1681 to 010302f Compare April 13, 2026 21:53

cs01 changed the title ~~perf: benchmark infrastructure — best-of-N sampling, 5% tie threshold, Apple Silicon baseline~~ perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, Apple Silicon baseline Apr 13, 2026

cs01 force-pushed the feat/local-bench-numbers branch from 010302f to e1e0594 Compare April 13, 2026 21:59

perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, app…

0140bdf

…le silicon baseline

cs01 force-pushed the feat/local-bench-numbers branch from e1e0594 to 0140bdf Compare April 13, 2026 22:14

cs01 merged commit 5a47850 into main Apr 13, 2026
13 checks passed

cs01 deleted the feat/local-bench-numbers branch April 13, 2026 22:28

cs01 mentioned this pull request Apr 13, 2026

update benchmark data: post-#504/#505/#506 perf wins (Apple Silicon N=10) #507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, Apple Silicon baseline#503

perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, Apple Silicon baseline#503
cs01 merged 1 commit intomainfrom
feat/local-bench-numbers

cs01 commented Apr 13, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cs01 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation: the jitter problem, resolved

Methodology change: Apple Silicon as primary

Results (N=10, 95% bootstrap CI, Apple Silicon M-series)

Files changed

Sanity checks

Trade-offs

Reproduce

Uh oh!

github-actions bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results (Linux x86-64)

CLI Tool Benchmarks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cs01 commented Apr 13, 2026 •

edited

Loading

github-actions bot commented Apr 13, 2026 •

edited

Loading