Skip to content

perf: lazy memory architecture — reduce per-pattern overhead 5-7x (#158)#160

Merged
kolkov merged 3 commits into
mainfrom
feature/lazy-memory-architecture
Jun 15, 2026
Merged

perf: lazy memory architecture — reduce per-pattern overhead 5-7x (#158)#160
kolkov merged 3 commits into
mainfrom
feature/lazy-memory-architecture

Conversation

@kolkov

@kolkov kolkov commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Adopts Rust regex Cache separation model to dramatically reduce memory per compiled pattern. Addresses #158 where Coraza WAF (900 OWASP CRS patterns) saw 16x memory overhead vs stdlib.

Changes:

  • PikeVM lazy initNewPikeVMLazy() defers thread queues/sparse set allocation to first search (~10 KB saved per unused PikeVM)
  • Shared DFA PikeVMSetPikeVM() eliminates duplicate PikeVM per DFA (~15-20 KB per DFA pattern)
  • Deferred SearchState — removed eager allocation at compile time (~15-50 KB per pattern)
  • Strategy-aware cachesnewSearchState() only allocates caches needed by active strategy (30-70% fewer allocations)
  • DFA initCap 64→16 — smaller initial maps/slices (~3 MB across 900 patterns)
  • CI — skip benchmark workflow for docs-only PRs

Architecture (Rust model):

  • Compiled regex = immutable, shareable
  • Search state = mutable, per-thread, lazy
  • Strategy drives what gets allocated

Test plan

  • go test ./... — all packages pass
  • golangci-lint run — no new issues
  • gofmt — all modified files clean
  • Quick benchmarks — no regression on BenchmarkFind
  • CI: tests (Linux/macOS/Windows) + benchmark comparison
  • CI: race detector (Linux)
  • regex-bench: Go + Rust comparison on AMD EPYC

Fixes #158

…158)

Adopt Rust regex Cache separation model to reduce memory overhead
from 16x to ~3x vs stdlib when compiling many patterns (WAF workloads).

Changes:
- PikeVM: add NewPikeVMLazy() with deferred internal state allocation
- DFA: share Engine PikeVM via SetPikeVM(), eliminate per-DFA clones
- SearchState: defer allocation to first search (not compile time)
- SearchState: strategy-aware cache allocation (skip unused engines)
- DFA cache: reduce initCap from 64 to 16 entries
- CI: skip benchmark workflow for docs-only PRs

For 900 OWASP CRS patterns, estimated savings:
- PikeVM lazy init: ~10 KB per unused PikeVM (CharClass, Teddy, AC)
- Shared DFA PikeVM: ~15-20 KB per DFA pattern
- Deferred SearchState: ~15-50 KB per pattern at compile time
- Strategy-aware caches: 30-70% fewer allocations per SearchState
- DFA initCap 64->16: ~3 MB across 900 patterns

Fixes #158
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.33333% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
nfa/pikevm.go 66.66% 5 Missing and 1 partial ⚠️
dfa/lazy/builder.go 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Fixes maintidx lint: Cyclomatic Complexity 25 -> lower by extracting
PikeVM sharing logic into a helper function.
@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown

Benchmark Comparison

Comparing main → PR #160

Summary: geomean 81.49n 76.71n -5.87%

⚠️ Potential regressions detected:

geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
MatchAnchoredLiteral/no_match_prefix-4                  2.581n ± ∞ ¹     2.604n ± ∞ ¹     +0.89% (p=0.008 n=5)
ASCIIOptimization_Issue79/short_WithoutASCII-4          302.4n ± ∞ ¹     326.6n ± ∞ ¹     +8.00% (p=0.008 n=5)
DNA_VsStdlib/stdlib/dna_4-4                             60.14m ± ∞ ¹     63.03m ± ∞ ¹     +4.80% (p=0.016 n=5)
LangArenaLogParser/ips-4                                48.13µ ± ∞ ¹     48.36µ ± ∞ ¹     +0.48% (p=0.032 n=5)
BranchDispatch_Stdlib/Digits-4                          132.1n ± ∞ ¹     132.5n ± ∞ ¹     +0.30% (p=0.048 n=5)
BranchDispatch_Stdlib/NoMatch-4                         78.14n ± ∞ ¹     78.97n ± ∞ ¹     +1.06% (p=0.008 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

@kolkov kolkov merged commit 2812db7 into main Jun 15, 2026
9 checks passed
@kolkov kolkov deleted the feature/lazy-memory-architecture branch June 15, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory usage is tens of times higher than stdlib when used as Coraza WAF regex engine replacement

1 participant