perf(codegen): O(n²)→O(n) ISA emit (fixes mlen=16 vision DNF), faster sim_env re-parse, roll RoPE+norm, rename ATEN_UNROLL→ATEN_OPS_UNROLL by booth-algo · Pull Request #56 · AICrossSim/PLENA_Compiler

booth-algo · 2026-06-02T21:52:24Z

Three host-time codegen/assembler perf fixes, each verified byte-identical or allclose-100%. None changes the modeled hardware latency except the small, documented loop-overhead noted in (3).

1. ISA emit O(n²)→O(n) (aten/plena/isa_emit.py). IsaEmitMixin accumulated the generated ISA via self.generated_code += rendered once per instruction — an O(n²) string copy that runs away at high instruction counts. mlen=16 vision emits 2.43M lines for a single layer, so n² blew past the 30-minute timeout (the long-standing mlen=16 vision DNF). Backing the buffer with a list of rendered chunks (append + join, exposed through a generated_code property) makes emission amortised O(1); the public type stays str and the output is byte-identical (verified: vision 32/32/4 regenerated to the byte at 599,202 lines). Result: mlen=16 vision compiles in ~52s and PASSES (allclose 100%), and the finishing mlen=32 vision isa_gen dropped 59.2s → 2.3s.

2. Faster sim_env ASM re-parse (assembler/parser.py, assembler/assembly_to_binary.py). The sim_env phase re-parses the emitted ASM text back into binary; the hot loop redefined parse_reg_or_int plus the masked-op sets on every line, and _convert_to_binary rebuilt ~8 opcode lists per instruction. Hoisting those to module scope (frozensets) and a single-pass line preprocessing make it ~17% faster on large programs, with byte-identical .mem output (verified on a 1.99M-instruction vision program and a decoder 256 5L program across the full opcode mix).

3. Roll RoPE + normalization; rename ATEN_UNROLL → ATEN_OPS_UNROLL (asm_templates/rope_asm.py, asm_templates/normalization_asm.py, aten/plena/isa_compiler.py, aten/plena/compiler.py). RoPE collapses to one C_LOOP (its per-iteration address is a single k·vlen progression); RMS/LayerNorm roll their inner hidden-dim loops. Both are gated on the existing unroll flag and rolled by default — ATEN_OPS_UNROLL=1 reproduces the prior unrolled output byte-identically (verified via a fully-unrolled old-vs-new anchor and a unit-level template diff). This cuts emitted lines (decoder 256 1L −54%, vision −5%, decoder 16/16/4 −2%) with allclose 100% across vision/decoder/vlm-e2e. NB: rolling is not exactly sim_lat-neutral — the C_LOOP + S_ADDI_INT pointer advances replace baked address loads, so modeled cycles shift slightly (measured +0.08% on decoder 16/16/4). The env var is renamed to the clearer ATEN_OPS_UNROLL; the codegen-compare harnesses in PLENA_Simulator are renamed in the paired PR.

Pairs with the PLENA_Simulator docs + harness-rename PR. The compiler tests pass (68/70; the 2 failures are pre-existing and unrelated — confirmed by stashing these changes) and ruff is clean.

…fixes mlen=16 vision codegen runaway (>30m DNF -> 52s)

…s/opcode sets out of per-line/per-instr loops, single-pass line preprocessing); byte-identical output, ~17% faster on large programs

…nv ATEN_UNROLL -> ATEN_OPS_UNROLL RoPE collapses to one C_LOOP (addr is a single k*vlen progression); RMS/LayerNorm roll their inner hidden-dim loops. Rolled by default (ATEN_OPS_UNROLL=1 forces the prior unrolled output, verified byte-identical). Cuts emitted lines (decoder 256 1L -54%, vision -5%, decoder 16/16/4 -2%); allclose 100%. Note: rolling is not sim_lat-neutral - the C_LOOP + S_ADDI_INT pointer advances replace baked address loads, shifting modeled cycles slightly (+0.08% on decoder 16/16/4).

booth-algo added 3 commits June 2, 2026 16:56

perf(isa): accumulate generated ISA in a list buffer (O(n^2)->O(n)); …

a82ce9c

…fixes mlen=16 vision codegen runaway (>30m DNF -> 52s)

perf(assembler): speed up ASM re-parse in sim_env (hoist parse helper…

ef2eb2b

…s/opcode sets out of per-line/per-instr loops, single-pass line preprocessing); byte-identical output, ~17% faster on large programs

booth-algo merged commit 74ff7a5 into main Jun 2, 2026
3 checks passed

booth-algo deleted the perf/isa-emit-list-buffer branch June 2, 2026 22:24

This was referenced Jun 2, 2026

fix(im2col): enable the V_SHIFT_V im2col path for non-64-aligned columns (native-dim vision) #57

Merged

fix(attention): correct causal mask for multi-tile decoder attention (seq_len > mlen) #58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(codegen): O(n²)→O(n) ISA emit (fixes mlen=16 vision DNF), faster sim_env re-parse, roll RoPE+norm, rename ATEN_UNROLL→ATEN_OPS_UNROLL#56

perf(codegen): O(n²)→O(n) ISA emit (fixes mlen=16 vision DNF), faster sim_env re-parse, roll RoPE+norm, rename ATEN_UNROLL→ATEN_OPS_UNROLL#56
booth-algo merged 3 commits into
mainfrom
perf/isa-emit-list-buffer

booth-algo commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

booth-algo commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant