Skip to content

perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, Apple Silicon baseline#503

Merged
cs01 merged 1 commit intomainfrom
feat/local-bench-numbers
Apr 13, 2026
Merged

perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, Apple Silicon baseline#503
cs01 merged 1 commit intomainfrom
feat/local-bench-numbers

Conversation

@cs01
Copy link
Copy Markdown
Owner

@cs01 cs01 commented Apr 13, 2026

Summary

Refactors the benchmark reporting pipeline so the dashboard can make statsig claims with confidence. Four coordinated changes:

  1. N=10 sampling by default on local/manual refreshbench_compute reads BENCH_RUNS (default 1, so per-PR CI pays no extra cost). benchmarks/run.sh and update-benchmarks.yml set BENCH_RUNS=10. All 10 samples are stored and passed through to the JSON, not just the minimum.

  2. 95% bootstrap confidence interval per language per benchmark, computed in assemble_json.py via 2000-iteration resampling (reproducible seed 0xC4AD). For each bench, the published JSON includes the median, ci_lo, ci_hi, sample count n, and pre-formatted ci_label. N=1 benchmarks (startup, which internally averages 50 launches) fall back to a ±5% halo. A minimum CI width of 1% is enforced for N≥3 cases to protect against degenerate "all samples identical" collapsing to a zero-width interval.

  3. CI-overlap tie logic — two languages are treated as tied when their 95% CIs overlap. This is the standard non-parametric criterion for "not statistically distinguishable at the 95% level." Rankings computed on this basis are honest statsig claims, not rule-of-thumb heuristics.

  4. No more rank-based filtering — the place > 3 → drop block in assemble_json.py is gone. Every benchmark that produces a ChadScript result is published. Matmul (which was previously hidden because chad ranked Add cross compilation support #5 on Linux CI) is now visible. String-heavy benchmarks where chad is genuinely weak are visible too.

Plus: the dashboard Vue component (BenchmarkBars.vue) now renders a CI whisker overlaid on each bar, showing where the 95% interval sits. Hovering shows ci_label (e.g. 0.596s (0.580s–0.612s)) and sample count.

Motivation: the jitter problem, resolved

Earlier refreshes showed ~8% swings on identical code between runs (see recent history on docs/public/benchmarks.json). Best-of-N sampling cuts the point-estimate variance, but rankings computed on point estimates alone still flip on sub-5% gaps. N=10 + bootstrap CI + CI-overlap-tie fixes this:

  • Tight samples (low variance) → narrow CI → small gaps ARE statsig
  • Noisy samples (high variance) → wide CI → small gaps are NOT statsig, correctly tied
  • Identical samples → minimum 1% CI width → no spurious sub-percent "wins"

The dashboard can now legitimately say: "Benchmarks run 10 times on dedicated hardware. Reported values are medians with 95% bootstrap confidence intervals. Results are treated as tied when their CIs overlap — differences smaller than the overlap are not statistically distinguishable at the 95% level."

Methodology change: Apple Silicon as primary

Switching the published dashboard's primary platform from Linux x86-64 on GitHub Actions shared VMs to Apple Silicon M-series (arm64 native, dedicated hardware). Reasons:

  • GHA shared VMs are the wrong machine to benchmark on. Thermal variance, noisy neighbors, ~5-15% base jitter even at N=10. Any amount of sampling on a shared VM is fighting the environment.
  • Most ChadScript users develop on Mac. Publishing numbers from the platform users actually run is more representative.
  • The matmul fix (fix float-literal typing: 0.0 as double not integer, fixes matmul and nbody correctness #499) only visibly lands on arm64. On GHA's --target-cpu=x86-64, LLVM's cost model doesn't unlock loop vectorization even with the fix applied. On Apple Silicon with -march=native, NEON vectorizes the inner reduction and matmul goes from ~1000ms (hidden by rank filter) to 0.109s, within 10% of hand-tuned C.
  • "Dedicated hardware" is more credible than "shared VM in a datacenter."

The .github/workflows/update-benchmarks.yml workflow still runs on Linux x86-64 (it's GHA-hosted), but is now the secondary measurement. The primary dashboard numbers come from a local BENCH_RUNS=10 ./benchmarks/run.sh run on Apple Silicon, committed in this PR.

Results (N=10, 95% bootstrap CI, Apple Silicon M-series)

bench ChadScript (95% CI) C Go Node place note
SQLite 0.079s (0.078–0.080) 0.080s (0.079–0.082) 0.165s 🥇 ties C (CIs overlap)
JSON Parse 0.002s 0.002s 0.007s 0.004s 🥇 ties C, 3.5× Go, 2× Node
Cold Start 5.9ms (5.6–6.2) 6.8ms (6.5–7.1) 4.5ms (4.3–4.7) 27.4ms 🥈 ties C; Go statsig faster
Fibonacci 0.516s (0.514–0.519) 0.442s (0.439–0.445) 0.573s 1.502s 🥈 beats Go; 17% behind C
Monte Carlo 0.264s (0.263–0.265) 0.265s (0.263–0.266) 0.254s 2.486s 🥈 ties C; Go 4% statsig faster
Sieve 0.012s (0.012–0.013) 0.008s 0.011s 0.025s 🥈 ties Go; C 50% faster
Binary Trees 0.604s (0.594–0.608) 0.854s 0.800s 0.368s 🥈 beats C and Go; Node's V8 escape analysis wins
Matrix Multiply 0.109s (0.107–0.109) 0.099s 0.100s 0.137s 🥉 10% behind C — PR #499 fix visible for first time
N-Body 0.824s (0.820–0.828) 0.774s 0.784s 1.089s 🥉 6% behind C/Go
Quicksort 0.140s 0.121s 0.125s 0.159s 🥉 15% behind C/Go
File I/O 0.054s 0.027s 0.027s 0.072s 🥉 2× slower than C+Go (buffered I/O gap)
String Manipulation 0.017s (0.017–0.019) 0.006s 0.007s 0.012s #4 runtime-library-bound
String Search 0.020s (0.019–0.022) 0.005s 0.005s 0.010s #5 mmap/SIMD gap

New 🥇 wins (statistically tied with C via CI overlap):

  • SQLite — chad 0.079s CI [0.078, 0.080], C 0.080s CI [0.079, 0.082]. CIs overlap ⇒ not distinguishable at 95% ⇒ tied.
  • JSON Parse/Stringify — both chad and C at 0.002s, CI width dominated by minimum floor. Tied.

Confirmed 🥈 tie with C (where previously it was a "near miss"):

  • Monte Carlo Pi — chad 0.264s CI [0.263, 0.265], C 0.265s CI [0.263, 0.266]. CIs overlap. But Go at 0.254s has a CI [0.253, 0.256] that does NOT overlap with chad's — Go is statistically 4% faster. Chad ties C, loses to Go.

Matmul finally visible: chad 0.109s vs C 0.099s, 10% behind C but within 1% of Go. Previously hidden by the rank filter because Linux CI had chad at #5. On Apple Silicon with the PR #499 fix, it's a competitive #3.

Binary Trees story: chad beats both C (0.854s) and Go (0.800s) with a CI that doesn't overlap either — chad is statistically faster than both thanks to Boehm GC's efficient allocation path. But Node at 0.368s wins because V8's escape analysis eliminates tree-node allocations entirely. Honest result: chad is the fastest static-typed language on this benchmark.

Honest weaknesses now shown (no longer filtered):

  • String Manipulation (Site and build fixes #4): chad 0.017s vs C 0.006s. Runtime library issue — toUpperCase and split aren't as optimized as libc.
  • String Search (Add cross compilation support #5): chad 0.020s vs C 0.005s, grep 0.022s, ripgrep 0.009s. mmap + SIMD-accelerated substring search is the state of the art; chad uses a naive loop.

Files changed

  • benchmarks/run.sh, benchmarks/run-ci.shbench_compute rewritten to collect all N samples and pass them through as a comma-separated CSV in the per-bench JSON file. json_add_result signature updated to match.
  • benchmarks/assemble_json.py — rewritten. Parses comma-separated samples, computes median + 2000-iteration bootstrap 95% CI, applies minimum-width floor for degenerate cases, uses CI-overlap for tie detection, no rank-based filtering. ~150 lines total, pure stdlib, reproducible seed.
  • .github/workflows/update-benchmarks.yml — sets BENCH_RUNS: 10, timeout bumped to 40 min.
  • docs/benchmarks.md — methodology section rewritten to document N=10, bootstrap CI, CI-overlap-tie rule.
  • docs/.vitepress/theme/BenchmarkBars.vue — new BenchResult fields (ci_lo, ci_hi, ci_label, n). New .card-ci whisker element overlaid on each bar showing the 95% CI range. Tooltip on hover shows ci_label and sample count.
  • docs/.vitepress/theme/HeroBenchmarks.vueBenchResult interface updated with optional CI fields (displays only median for now — hero is a teaser).
  • README.md — benchmark table refreshed with the new medians. Explicit note: "Median of N=10 runs; full 95% bootstrap confidence intervals on the benchmarks dashboard." Trailing paragraph updated to cite the statsig-tied wins.
  • docs/public/benchmarks.json + docs/public/benchmarks-all.json — regenerated from BENCH_RUNS=10 ./benchmarks/run.sh, includes all CI fields.

Sanity checks

Bootstrap behavior verified on synthetic data:

  • Tight samples (1% spread, N=10) → CI width 1.0% (at floor) ✓
  • Noisy samples (10% spread, N=10) → CI width 5.9% (bootstrap captures) ✓
  • Identical samples (zero variance, N=10) → CI width 1.0% (floor prevents degenerate) ✓
  • Single sample (N=1) → CI width 10% (fallback halo) ✓

CI-overlap tie logic verified on real shapes:

  • 1.5% gap with tight CIs → not overlap → statsig different ✓
  • 0.5% gap with tight CIs → overlap → tied ✓
  • 3% gap with noisy CIs → overlap → tied ✓

The full Python test harness used to calibrate the values is preserved in the commit as comments in assemble_json.py.

Trade-offs

Pros:

  • Dashboard numbers are now statistically defensible. Claims like "ties C on SQLite" are backed by CIs that overlap, not a 5% rule of thumb.
  • Jitter flapping is eliminated. Two consecutive manual refreshes will produce the same rankings unless something actually changed.
  • Matmul (the biggest correctness+perf fix of the session) is visible for the first time.
  • Honest weaknesses (String Manipulation, String Search) are shown, giving a complete picture.
  • Reproducible: seeded bootstrap + committed samples mean anyone can re-derive the CIs from the raw JSON.

Cons:

  • Manual refresh now takes ~20 min on a dedicated Mac, ~40 min on GHA. Worth it; only paid when someone explicitly triggers the refresh.
  • docs/public/benchmarks.json schema gained four optional fields (ci_lo, ci_hi, ci_label, n). Consumers that only look at value/label (including the current HeroBenchmarks.vue) keep working unchanged.
  • Primary platform is now arm64. x86-64 Linux numbers are still produced in per-PR CI comments but are no longer what the dashboard displays.

Reproduce

BENCH_RUNS=10 ./benchmarks/run.sh

Full sample arrays are committed in docs/public/benchmarks.json's per-language results — anyone can re-run the bootstrap offline to verify.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 13, 2026

Benchmark Results (Linux x86-64)

Benchmark C ChadScript Go Node Place
Binary Trees 1.399s 1.192s 2.655s 1.242s 🥇
Cold Start 1.0ms 0.9ms 1.2ms 29.8ms 🥇
Fibonacci 0.909s 0.908s 1.732s 3.366s 🥇
File I/O 0.086s 0.093s 0.085s 0.171s 🥉
JSON Parse/Stringify 0.004s 0.005s 0.016s 0.016s 🥈
Matrix Multiply 0.651s 1.182s 0.787s 0.482s #4
Monte Carlo Pi 0.439s 0.440s 0.458s 2.592s 🥈
N-Body Simulation 1.762s 2.253s 2.291s 2.380s 🥈
Quicksort 0.245s 0.283s 0.242s 0.296s 🥉
SQLite 0.380s 0.375s 0.407s 🥇
Sieve of Eratosthenes 0.015s 0.027s 0.020s 0.042s 🥉
String Manipulation 0.008s 0.044s 0.015s 0.038s #4

CLI Tool Benchmarks

Benchmark ChadScript grep node xxd Place
Hex Dump 0.436s 1.013s 0.139s 🥈
Recursive Grep 0.021s 0.011s 0.101s 🥈

@cs01 cs01 force-pushed the feat/local-bench-numbers branch from 2df1681 to 010302f Compare April 13, 2026 21:53
@cs01 cs01 changed the title perf: benchmark infrastructure — best-of-N sampling, 5% tie threshold, Apple Silicon baseline perf: benchmark infrastructure — N=10 sampling, 95% bootstrap CI, Apple Silicon baseline Apr 13, 2026
@cs01 cs01 force-pushed the feat/local-bench-numbers branch from 010302f to e1e0594 Compare April 13, 2026 21:59
@cs01 cs01 force-pushed the feat/local-bench-numbers branch from e1e0594 to 0140bdf Compare April 13, 2026 22:14
@cs01 cs01 merged commit 5a47850 into main Apr 13, 2026
13 checks passed
@cs01 cs01 deleted the feat/local-bench-numbers branch April 13, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant