perf: direct i1 branch propagation for integer comparisons and compound loop guards by cs01 · Pull Request #508 · cs01/ChadScript

cs01 · 2026-04-13T23:10:59Z

Two related lowering fixes for the branch-condition path. Both are correctness-preserving cleanups that kill dead conversions in the emitted LLVM IR — not performance transforms in the usual sense, just removing wasted code the optimizer would have liked to eliminate but couldn't.

Fix 1: wantsI1 fast-path in Uint8Array + charAt comparison optimizations

`BinaryExpressionGenerator` has two fast-path helpers that recognize specific comparison shapes:

`tryOptimizeUint8ArrayComparison` — `flags[i] === 1`, `byte[j] < 128`, etc. on `Uint8Array` operands. Emits a direct `icmp eq i8` against the byte value instead of loading through the generic numeric path.
`tryOptimizeCharAtComparison` — `str.charAt(i) === 'a'` style. Emits a direct `icmp eq i8` against the char byte after an inline bounds-checked load.

Both helpers correctly produce an `i1` from the `icmp`, but neither honored the global `wantsI1` flag that tells a comparison "your result is going directly into a branch condition, skip the i32/double round-trip." So even though the comparison was already a boolean ready to feed `br i1`, they would emit:

```llvm
%81 = icmp eq i8 %80, 1 ; the useful part
%82 = zext i1 %81 to i32 ; ← dead
%83 = sitofp i32 %82 to double ; ← dead
%84 = fcmp one double %83, 0.0 ; ← dead
br i1 %84, label %then, ... ; ← the branch we wanted three instructions ago
```

Every other comparison path in `BinaryExpressionGenerator` checks `getWantsI1()` at the end and returns the raw `i1` when a branch wants it (see `generateNumericComparison` at line ~573). These two fast paths were missed.

Fix is 4 lines in each helper: check `getWantsI1()` before the zext/sitofp chain, and if set, return the raw i1 directly. The `tryOptimizeCharAtComparison` case also required restructuring its three-way `phi` to merge `i1` values instead of `i32`, since the phi operands need to match the new return type.

Post-fix IR for `if (flags[p] === 1) { ... }` in the sieve benchmark:

```llvm
%81 = icmp eq i8 %80, 1 ; single icmp
br i1 %81, label %then12, ... ; direct branch
```

Two instructions, not five.

Fix 2: `isSimpleComparisonForBranch` was too conservative on compound operands

`generateBranchCondition` in `control-flow.ts` has a fast path that recognizes simple comparison expressions and sets `wantsI1=true` before recursing into `generateExpression`, so the final comparison returns a raw `i1` for the branch. The predicate `isSimpleComparisonForBranch` decided which expressions qualified.

It bailed on any nested binary operand — i.e. `while (p * p <= LIMIT)` fell through to the slow path because `p * p` is a `binary` node. The slow path generates the comparison as an expression (returning a `double` from `generateNumericComparison`'s zext+sitofp branch) and then `convertToBool`'s that double back to `i1` via another `fcmp one 0.0`. Six extra instructions per loop iteration for an entirely dead conversion.

Replaced the flat "reject binary" check with a recursive `isPureSubexprForBranch` walker. A sub-expression is pure for the branch path iff it does not consume the `wantsI1` flag itself — which means no nested comparisons, no logical ops (`&&`/`||`/`??`), no conditionals, no calls, no arrow functions, no awaits. Pure arithmetic (mul/add/sub/shifts), unary minus/bitnot, variable reads, member/index accesses, and literals all pass.

The walker is deliberately conservative about `unary !` (which wraps a comparison that would consume wantsI1) and `binary` comparison/logical ops (which would also consume wantsI1 before the outer comparison sees it). For everything else, wantsI1 propagates cleanly.

Post-fix IR for sieve's outer loop `while (p * p <= LIMIT)`:

```llvm
while_cond9:
%65 = load i64, i64* %p.addr.3
%66 = load i64, i64* %p.addr.3
%67 = mul i64 %65, %66
%68 = load double, double* @limit
%69 = sitofp i64 %67 to double
%70 = fcmp ole double %69, %68
br i1 %70, label %while_body10, label %while_end11
```

Previously this had three extra instructions (`zext i1 → i32`, `sitofp i32 → double`, `fcmp one double, 0.0`) between the `fcmp ole` and the `br`. Gone now.

Honest note on measurable impact

Both fixes are structurally correct but the measurable perf impact on the current benchmark suite is small, and I want to be upfront about that rather than oversell:

OoO execution on Apple Silicon's M-series (~600-instruction reorder buffer) was already hiding most of these dead conversions because they sit on well-predicted branches behind an FMA/cmp that takes longer to retire. The instructions were real but they were being scheduled in the shadow of slower ops.
The sieve benchmark runs each pattern 10M times, which sounds like a lot, but each iteration saves ~2ns in the best case — and GHA-style per-run variance swamps that at N=10.
The real sieve bottleneck elsewhere in the benchmark is `flags[m] = 0` in the marking loop, which goes through `i64 → double → i32 → i8` for a literal zero store (separate frontend lowering bug), plus bounds checks on the marking loop. Neither is addressed here.

So: expect sieve to move somewhere between 0% and 3% on the dashboard, probably landing inside noise. The real value of this PR is cleaner IR that makes downstream LLVM optimizations easier and removes a class of "future we'll fix this" footguns where a new comparison fast path gets added and forgets to honor `wantsI1`.

What this PR does NOT fix

The `LIMIT` module constant being stored as `double` when it's obviously an integer literal, forcing a `sitofp` in every comparison against it. That's a frontend type inference issue.
The `flags[m] = 0` store going through four conversions for a literal zero. Separate lowering bug, worth its own PR.
Bounds checks on the sieve marking loop (`while (m <= LIMIT) flags[m] = 0`) — requires cross-statement value-range reasoning to prove `m` is within `flags.length`.
The `while (j <= LIMIT)` counting loop (10M iters) still does `sitofp i64 → double` + `fcmp` because `LIMIT` is `double`. Same root cause as item 1.

These are all on the perf follow-up roadmap; this PR is the small correct cleanup that generalizes most broadly.

Tests

```
✔ 777/777 pass
suites 37
duration 65.7s
```

All three of the existing `tryOptimizeUint8ArrayComparison` / `tryOptimizeCharAtComparison` / `isSimpleComparisonForBranch` code paths are exercised in the test corpus, and none needed updates. The fixes are strictly subtractive — they replace emitted IR with a shorter equivalent that produces the same branch outcome.

Measurement

Sieve: no statsig change (11-12ms either way, within N=10 CI noise)
All other benchmarks unchanged
Self-compile time: unchanged

Scope

Two files:

`src/codegen/expressions/operators/binary.ts` — `tryOptimizeUint8ArrayComparison` and `tryOptimizeCharAtComparison` now honor `wantsI1`
`src/codegen/statements/control-flow.ts` — `isSimpleComparisonForBranch` now accepts pure-arithmetic sub-expressions via new `isPureSubexprForBranch` walker

Diff: +83 / −12 across two files. No new dependencies, no new runtime code.

…nd loop guards

github-actions · 2026-04-13T23:14:01Z

Benchmark Results (Linux x86-64)

Benchmark	C	ChadScript	Go	Node	Place
Binary Trees	1.593s	1.289s	2.816s	1.212s	🥈
Cold Start	0.9ms	0.9ms	1.3ms	29.5ms	🥇
Fibonacci	0.815s	0.763s	1.561s	3.149s	🥇
File I/O	0.118s	0.094s	0.089s	0.208s	🥈
JSON Parse/Stringify	0.004s	0.005s	0.017s	0.016s	🥈
Matrix Multiply	0.453s	0.748s	0.583s	0.368s	#4
Monte Carlo Pi	0.403s	0.410s	0.405s	2.248s	🥉
N-Body Simulation	1.672s	2.130s	2.206s	2.399s	🥈
Quicksort	0.219s	0.247s	0.213s	0.262s	🥉
SQLite	0.350s	0.369s	—	0.479s	🥈
Sieve of Eratosthenes	0.014s	0.028s	0.020s	0.041s	🥉
String Manipulation	0.008s	0.018s	0.017s	0.035s	🥉

CLI Tool Benchmarks

Benchmark	ChadScript	grep	node	xxd	Place
Hex Dump	0.429s	—	1.017s	0.131s	🥈
Recursive Grep	0.019s	0.010s	0.099s	—	🥈

perf: direct i1 branch propagation for integer comparisons and compou…

6942e48

…nd loop guards

cs01 mentioned this pull request Apr 14, 2026

fix: popSafeIndexScope leaking entries across loop bodies in self-hosted build #510

Closed

cs01 closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: direct i1 branch propagation for integer comparisons and compound loop guards#508

perf: direct i1 branch propagation for integer comparisons and compound loop guards#508
cs01 wants to merge 1 commit intomainfrom
perf/int-compare-direct

cs01 commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cs01 commented Apr 13, 2026

Fix 1: wantsI1 fast-path in Uint8Array + charAt comparison optimizations

Fix 2: `isSimpleComparisonForBranch` was too conservative on compound operands

Honest note on measurable impact

What this PR does NOT fix

Tests

Measurement

Scope

Uh oh!

github-actions bot commented Apr 13, 2026

Benchmark Results (Linux x86-64)

CLI Tool Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant