Skip to content

perf: direct i1 branch propagation for integer comparisons and compound loop guards#508

Closed
cs01 wants to merge 1 commit intomainfrom
perf/int-compare-direct
Closed

perf: direct i1 branch propagation for integer comparisons and compound loop guards#508
cs01 wants to merge 1 commit intomainfrom
perf/int-compare-direct

Conversation

@cs01
Copy link
Copy Markdown
Owner

@cs01 cs01 commented Apr 13, 2026

Two related lowering fixes for the branch-condition path. Both are correctness-preserving cleanups that kill dead conversions in the emitted LLVM IR — not performance transforms in the usual sense, just removing wasted code the optimizer would have liked to eliminate but couldn't.

Fix 1: wantsI1 fast-path in Uint8Array + charAt comparison optimizations

`BinaryExpressionGenerator` has two fast-path helpers that recognize specific comparison shapes:

  • `tryOptimizeUint8ArrayComparison` — `flags[i] === 1`, `byte[j] < 128`, etc. on `Uint8Array` operands. Emits a direct `icmp eq i8` against the byte value instead of loading through the generic numeric path.
  • `tryOptimizeCharAtComparison` — `str.charAt(i) === 'a'` style. Emits a direct `icmp eq i8` against the char byte after an inline bounds-checked load.

Both helpers correctly produce an `i1` from the `icmp`, but neither honored the global `wantsI1` flag that tells a comparison "your result is going directly into a branch condition, skip the i32/double round-trip." So even though the comparison was already a boolean ready to feed `br i1`, they would emit:

```llvm
%81 = icmp eq i8 %80, 1 ; the useful part
%82 = zext i1 %81 to i32 ; ← dead
%83 = sitofp i32 %82 to double ; ← dead
%84 = fcmp one double %83, 0.0 ; ← dead
br i1 %84, label %then, ... ; ← the branch we wanted three instructions ago
```

Every other comparison path in `BinaryExpressionGenerator` checks `getWantsI1()` at the end and returns the raw `i1` when a branch wants it (see `generateNumericComparison` at line ~573). These two fast paths were missed.

Fix is 4 lines in each helper: check `getWantsI1()` before the zext/sitofp chain, and if set, return the raw i1 directly. The `tryOptimizeCharAtComparison` case also required restructuring its three-way `phi` to merge `i1` values instead of `i32`, since the phi operands need to match the new return type.

Post-fix IR for `if (flags[p] === 1) { ... }` in the sieve benchmark:

```llvm
%81 = icmp eq i8 %80, 1 ; single icmp
br i1 %81, label %then12, ... ; direct branch
```

Two instructions, not five.

Fix 2: `isSimpleComparisonForBranch` was too conservative on compound operands

`generateBranchCondition` in `control-flow.ts` has a fast path that recognizes simple comparison expressions and sets `wantsI1=true` before recursing into `generateExpression`, so the final comparison returns a raw `i1` for the branch. The predicate `isSimpleComparisonForBranch` decided which expressions qualified.

It bailed on any nested binary operand — i.e. `while (p * p <= LIMIT)` fell through to the slow path because `p * p` is a `binary` node. The slow path generates the comparison as an expression (returning a `double` from `generateNumericComparison`'s zext+sitofp branch) and then `convertToBool`'s that double back to `i1` via another `fcmp one 0.0`. Six extra instructions per loop iteration for an entirely dead conversion.

Replaced the flat "reject binary" check with a recursive `isPureSubexprForBranch` walker. A sub-expression is pure for the branch path iff it does not consume the `wantsI1` flag itself — which means no nested comparisons, no logical ops (`&&`/`||`/`??`), no conditionals, no calls, no arrow functions, no awaits. Pure arithmetic (mul/add/sub/shifts), unary minus/bitnot, variable reads, member/index accesses, and literals all pass.

The walker is deliberately conservative about `unary !` (which wraps a comparison that would consume wantsI1) and `binary` comparison/logical ops (which would also consume wantsI1 before the outer comparison sees it). For everything else, wantsI1 propagates cleanly.

Post-fix IR for sieve's outer loop `while (p * p <= LIMIT)`:

```llvm
while_cond9:
%65 = load i64, i64* %p.addr.3
%66 = load i64, i64* %p.addr.3
%67 = mul i64 %65, %66
%68 = load double, double* @limit
%69 = sitofp i64 %67 to double
%70 = fcmp ole double %69, %68
br i1 %70, label %while_body10, label %while_end11
```

Previously this had three extra instructions (`zext i1 → i32`, `sitofp i32 → double`, `fcmp one double, 0.0`) between the `fcmp ole` and the `br`. Gone now.

Honest note on measurable impact

Both fixes are structurally correct but the measurable perf impact on the current benchmark suite is small, and I want to be upfront about that rather than oversell:

  • OoO execution on Apple Silicon's M-series (~600-instruction reorder buffer) was already hiding most of these dead conversions because they sit on well-predicted branches behind an FMA/cmp that takes longer to retire. The instructions were real but they were being scheduled in the shadow of slower ops.
  • The sieve benchmark runs each pattern 10M times, which sounds like a lot, but each iteration saves ~2ns in the best case — and GHA-style per-run variance swamps that at N=10.
  • The real sieve bottleneck elsewhere in the benchmark is `flags[m] = 0` in the marking loop, which goes through `i64 → double → i32 → i8` for a literal zero store (separate frontend lowering bug), plus bounds checks on the marking loop. Neither is addressed here.

So: expect sieve to move somewhere between 0% and 3% on the dashboard, probably landing inside noise. The real value of this PR is cleaner IR that makes downstream LLVM optimizations easier and removes a class of "future we'll fix this" footguns where a new comparison fast path gets added and forgets to honor `wantsI1`.

What this PR does NOT fix

  • The `LIMIT` module constant being stored as `double` when it's obviously an integer literal, forcing a `sitofp` in every comparison against it. That's a frontend type inference issue.
  • The `flags[m] = 0` store going through four conversions for a literal zero. Separate lowering bug, worth its own PR.
  • Bounds checks on the sieve marking loop (`while (m <= LIMIT) flags[m] = 0`) — requires cross-statement value-range reasoning to prove `m` is within `flags.length`.
  • The `while (j <= LIMIT)` counting loop (10M iters) still does `sitofp i64 → double` + `fcmp` because `LIMIT` is `double`. Same root cause as item 1.

These are all on the perf follow-up roadmap; this PR is the small correct cleanup that generalizes most broadly.

Tests

```
✔ 777/777 pass
suites 37
duration 65.7s
```

All three of the existing `tryOptimizeUint8ArrayComparison` / `tryOptimizeCharAtComparison` / `isSimpleComparisonForBranch` code paths are exercised in the test corpus, and none needed updates. The fixes are strictly subtractive — they replace emitted IR with a shorter equivalent that produces the same branch outcome.

Measurement

  • Sieve: no statsig change (11-12ms either way, within N=10 CI noise)
  • All other benchmarks unchanged
  • Self-compile time: unchanged

Scope

Two files:

  • `src/codegen/expressions/operators/binary.ts` — `tryOptimizeUint8ArrayComparison` and `tryOptimizeCharAtComparison` now honor `wantsI1`
  • `src/codegen/statements/control-flow.ts` — `isSimpleComparisonForBranch` now accepts pure-arithmetic sub-expressions via new `isPureSubexprForBranch` walker

Diff: +83 / −12 across two files. No new dependencies, no new runtime code.

@github-actions
Copy link
Copy Markdown
Contributor

Benchmark Results (Linux x86-64)

Benchmark C ChadScript Go Node Place
Binary Trees 1.593s 1.289s 2.816s 1.212s 🥈
Cold Start 0.9ms 0.9ms 1.3ms 29.5ms 🥇
Fibonacci 0.815s 0.763s 1.561s 3.149s 🥇
File I/O 0.118s 0.094s 0.089s 0.208s 🥈
JSON Parse/Stringify 0.004s 0.005s 0.017s 0.016s 🥈
Matrix Multiply 0.453s 0.748s 0.583s 0.368s #4
Monte Carlo Pi 0.403s 0.410s 0.405s 2.248s 🥉
N-Body Simulation 1.672s 2.130s 2.206s 2.399s 🥈
Quicksort 0.219s 0.247s 0.213s 0.262s 🥉
SQLite 0.350s 0.369s 0.479s 🥈
Sieve of Eratosthenes 0.014s 0.028s 0.020s 0.041s 🥉
String Manipulation 0.008s 0.018s 0.017s 0.035s 🥉

CLI Tool Benchmarks

Benchmark ChadScript grep node xxd Place
Hex Dump 0.429s 1.017s 0.131s 🥈
Recursive Grep 0.019s 0.010s 0.099s 🥈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant