Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,12 @@ on:
branches:
- main
- develop
paths-ignore:
- '**.md'
- 'docs/**'
- 'LICENSE'
- '.gitignore'
- '.github/FUNDING.yml'

permissions:
contents: read
Expand Down
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- ARM NEON SIMD support (Go 1.26 `simd/archsimd` intrinsics — [#120](https://github.com/coregx/coregex/issues/120))
- SIMD prefilter for CompositeSequenceDFA (#83)

## [0.12.22] - 2026-06-15

### Performance
- **Lazy memory architecture** — adopts Rust regex Cache separation model to reduce
per-pattern memory overhead from 16x to ~3x vs stdlib ([#158](https://github.com/coregx/coregex/issues/158)).
Unblocks WAF adoption (Coraza with 900 OWASP CRS patterns).

- **PikeVM lazy init** — `NewPikeVMLazy()` defers thread queues/sparse set allocation
to first search (~10 KB saved per unused PikeVM)
- **Shared DFA PikeVM** — `SetPikeVM()` eliminates duplicate PikeVM per DFA
(~15-20 KB per DFA pattern)
- **Deferred SearchState** — removed eager allocation at compile time
(~15-50 KB per pattern)
- **Strategy-aware caches** — `newSearchState()` only allocates caches needed by
active strategy (30-70% fewer allocations per SearchState)
- **DFA cache initCap 64→16** — smaller initial maps/slices (~3 MB across 900 patterns)

### Added
- Open Collective sponsorship badges and Sponsors section in README

### Fixed
- CI: benchmark workflow now skips docs-only PRs (`paths-ignore` for `.md`, `docs/`, etc.)

## [0.12.21] - 2026-03-27

### Performance
Expand Down
9 changes: 6 additions & 3 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

> **Strategic Focus**: Production-grade regex engine with RE2/rust-regex level optimizations

**Last Updated**: 2026-03-25 | **Current Version**: v0.12.19 | **Target**: v1.0.0 stable
**Last Updated**: 2026-06-15 | **Current Version**: v0.12.22 | **Target**: v1.0.0 stable

---

Expand Down Expand Up @@ -100,8 +100,11 @@ v0.12.19 ✅ → Zero-alloc FindSubmatch, byte-based DFA cache, Rust-aligned vis
v0.12.20 ✅ → Premultiplied/tagged StateIDs, break-at-match DFA determinize,
Phase 3 elimination (2-pass bidirectional DFA)
v0.12.21 (Current) → Tagged start states, zero-alloc API (AllIndex iter.Seq),
1100x fewer mallocs, UseDFA for tiny NFA, -32% LangArena
v0.12.21 ✅ → Tagged start states, zero-alloc API (AllIndex iter.Seq),
1100x fewer mallocs, UseDFA for tiny NFA, -32% LangArena
v0.12.22 (Current) → Lazy memory architecture (Rust Cache model), 5-7x memory
reduction per pattern, WAF adoption unblocked (#158)
v1.0.0-rc → Feature freeze, API locked
Expand Down
18 changes: 18 additions & 0 deletions dfa/lazy/builder.go
Original file line number Diff line number Diff line change
Expand Up @@ -494,6 +494,24 @@ func CompileWithPrefilter(n *nfa.NFA, config Config, pf prefilter.Prefilter) (*D
return dfa, nil
}

// SetPikeVM replaces the DFA's internal PikeVM with an externally-provided one.
// This enables sharing a single PikeVM between the Engine and its DFA(s),
// eliminating duplicate PikeVM allocations (~15-20 KB each for 100-state NFA).
//
// Issue #158: Each DFA (forward, reverse, strategy-specific) previously created
// its own PikeVM. With ~900 OWASP CRS patterns, many of which compile multiple
// DFAs, this was a major contributor to the 16x memory overhead vs stdlib.
//
// The provided PikeVM must be built from the same NFA (or a compatible variant)
// used to compile this DFA. Thread safety: PikeVM's Search methods use internal
// state, so the DFA's NFA fallback path is not thread-safe. However, in practice
// the meta layer always uses per-goroutine SearchState with its own PikeVM for
// actual searches, and the DFA's embedded PikeVM is only used during DFA-internal
// fallback within a single goroutine's search path.
func (d *DFA) SetPikeVM(pvm *nfa.PikeVM) {
d.pikevm = pvm
}

// CompilePattern is a convenience function to compile a regex pattern directly to DFA.
// This combines NFA compilation and DFA construction.
//
Expand Down
12 changes: 7 additions & 5 deletions dfa/lazy/lazy.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ import (
// - An NFA (the source automaton) — immutable
// - A configuration — immutable
// - An optional prefilter for fast candidate finding — immutable
// - A PikeVM for NFA fallback — immutable (Search methods are safe)
// - A PikeVM for NFA fallback — shared with Engine (Issue #158)
// - ByteClasses for alphabet reduction — immutable
//
// Thread safety: The DFA struct is immutable after compilation and safe
Expand All @@ -64,7 +64,7 @@ type DFA struct {
nfa *nfa.NFA
config Config
prefilter prefilter.Prefilter
pikevm *nfa.PikeVM
pikevm *nfa.PikeVM // NFA fallback — may be shared with Engine (Issue #158)

// byteClasses maps bytes to equivalence classes for alphabet reduction.
// Bytes in the same class have identical transitions in all DFA states.
Expand Down Expand Up @@ -102,9 +102,11 @@ type DFA struct {
// - A stateList for O(1) state-by-ID lookup
// - A StartTable with the DFA's immutable byteMap
func (d *DFA) NewCache() *DFACache {
// Start small — grow on demand. Pre-allocating MaxStates (10,000) wastes
// ~400KB per cache and dominates cold-start cost for pooled caches.
const initCap = 64
// Start small — grow on demand during search. Most WAF patterns never
// exercise more than a handful of DFA states. Reducing from 64 to 16
// saves ~48 map entries + 48*stride flatTrans slots per cache.
// Issue #158: with ~900 OWASP CRS patterns, this alone saves ~3 MB.
const initCap = 16
stride := d.AlphabetLen()
return &DFACache{
states: make(map[StateKey]*State, initCap),
Expand Down
27 changes: 22 additions & 5 deletions meta/compile.go
Original file line number Diff line number Diff line change
Expand Up @@ -603,7 +603,8 @@ func CompileRegexp(re *syntax.Regexp, config Config) (*Engine, error) {
// Initialize state pool for thread-safe concurrent searches
numCaptures := nfaEngine.CaptureCount()

ssCfg := buildSearchStateConfig(pikevmNFA, numCaptures, engines, strategy)
ssCfg := buildSearchStateConfig(pikevmNFA, numCaptures, engines, strategy, onePassRes != nil)
sharePikeVMWithDFAs(nfaEngine, engines)

eng := &Engine{
nfa: nfaEngine,
Expand Down Expand Up @@ -642,9 +643,12 @@ func CompileRegexp(re *syntax.Regexp, config Config) (*Engine, error) {
stats: Stats{},
}

// Eagerly create one SearchState and store it in the local GC-proof cache.
// This ensures the first search call doesn't allocate via sync.Pool.
eng.localState.Store(newSearchState(ssCfg))
// Issue #158: Defer SearchState allocation to first search.
// The localState cache is NOT populated at compile time. Instead, it will be
// lazily created on the first search call via getSearchState(). This saves
// ~15-50 KB per compiled pattern for WAF workloads where patterns may be
// compiled but never searched (e.g., pattern sets loaded at startup).
// The sync.Pool in statePool handles subsequent allocations efficiently.

return eng, nil
}
Expand Down Expand Up @@ -707,14 +711,27 @@ func configurePikeVMSkipAhead(pikevm *nfa.PikeVM, pf prefilter.Prefilter, isStar
}
}

// sharePikeVMWithDFAs creates a single shared PikeVM and injects it into forward
// DFAs to eliminate duplicate allocations (~15-20 KB per DFA for 100-state NFA).
// Reverse DFAs use different (reversed) NFAs so they keep their own PikeVMs.
func sharePikeVMWithDFAs(nfaEngine *nfa.NFA, engines strategyEngines) {
shared := nfa.NewPikeVM(nfaEngine)
if engines.dfa != nil {
engines.dfa.SetPikeVM(shared)
}
}

// buildSearchStateConfig extracts all DFA references needed for per-search caches.
// Strategy-specific DFAs come from reverse searchers (which have their own DFAs).
func buildSearchStateConfig(nfaEngine *nfa.NFA, numCaptures int, engines strategyEngines, strategy Strategy) searchStateConfig {
// The strategy is stored so newSearchState can conditionally allocate only what's needed.
func buildSearchStateConfig(nfaEngine *nfa.NFA, numCaptures int, engines strategyEngines, strategy Strategy, hasOnePass bool) searchStateConfig {
cfg := searchStateConfig{
nfaEngine: nfaEngine,
numCaptures: numCaptures,
forwardDFA: engines.dfa,
reverseDFA: engines.reverseDFA,
strategy: strategy,
hasOnePass: hasOnePass,
}

// Extract strategy-specific DFAs from reverse searchers
Expand Down
60 changes: 49 additions & 11 deletions meta/search_state.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,31 +69,69 @@ type searchStateConfig struct {
reverseDFA *lazy.DFA // e.reverseDFA (main engine reverse DFA)
stratFwdDFA *lazy.DFA // strategy-specific forward DFA (reverse searchers)
stratRevDFA *lazy.DFA // strategy-specific reverse DFA (reverse searchers)
strategy Strategy // active strategy — drives conditional allocation
hasOnePass bool // true if OnePass DFA was compiled
}

// newSearchState creates a new SearchState with pre-allocated buffers.
// newSearchState creates a new SearchState with strategy-aware allocation.
// Only components needed by the active strategy are allocated, reducing
// memory per compiled pattern by 30-70% for WAF-style workloads where
// most patterns use simple strategies (NFA, CharClass, Teddy, etc.).
//
// Issue #158: OWASP CRS patterns averaged 152 KB/pattern vs stdlib's 9.5 KB.
// Strategy-aware allocation cuts unnecessary DFA caches and backtracker state.
func newSearchState(cfg searchStateConfig) *SearchState {
state := &SearchState{
backtracker: nfa.NewBacktrackerState(),
pikevm: nfa.NewPikeVM(cfg.nfaEngine),
state := &SearchState{}

// PikeVM is always needed — it's the universal fallback engine and used
// by FindSubmatch two-phase search for capture extraction.
// Issue #158: Use lazy initialization — the PikeVM's internal state (thread
// queues, sparse set, ~10 KB for 100-state NFA) is allocated on first search,
// not at SearchState creation. For strategies like UseCharClassSearcher or
// UseTeddy, the PikeVM is rarely invoked, so this saves significant memory
// in WAF workloads with ~900 patterns.
state.pikevm = nfa.NewPikeVMLazy(cfg.nfaEngine)

// Backtracker state: only allocate for strategies that use it.
// Strategies: UseBoundedBacktracker, UseNFA (small NFA fallback BT).
// Also needed by strategies with DFA that may overflow to BT.
switch cfg.strategy {
case UseBoundedBacktracker, UseNFA, UseDFA, UseBoth, UseDigitPrefilter:
state.backtracker = nfa.NewBacktrackerState()
}

// Create per-search DFA caches for thread-safe concurrent access.
// Forward DFA cache: only if a forward DFA was compiled AND strategy uses it.
if cfg.forwardDFA != nil {
state.dfaCache = cfg.forwardDFA.NewCache()
switch cfg.strategy {
case UseDFA, UseBoth, UseDigitPrefilter, UseBoundedBacktracker:
state.dfaCache = cfg.forwardDFA.NewCache()
}
}

// Reverse DFA cache: only for bidirectional search strategies.
if cfg.reverseDFA != nil {
state.revDFACache = cfg.reverseDFA.NewCache()
switch cfg.strategy {
case UseDFA, UseBoundedBacktracker:
state.revDFACache = cfg.reverseDFA.NewCache()
}
}

// Strategy-specific DFA caches: only for reverse-search strategies.
if cfg.stratFwdDFA != nil {
state.stratFwdCache = cfg.stratFwdDFA.NewCache()
switch cfg.strategy {
case UseReverseSuffix, UseReverseInner, UseReverseSuffixSet, UseMultilineReverseSuffix:
state.stratFwdCache = cfg.stratFwdDFA.NewCache()
}
}
if cfg.stratRevDFA != nil {
state.stratRevCache = cfg.stratRevDFA.NewCache()
switch cfg.strategy {
case UseReverseSuffix, UseReverseInner, UseReverseSuffixSet, UseReverseAnchored:
state.stratRevCache = cfg.stratRevDFA.NewCache()
}
}

// Pre-allocate onepass slots if captures are present
if cfg.numCaptures > 0 {
// OnePass slots: only if OnePass DFA was compiled and captures exist.
if cfg.hasOnePass && cfg.numCaptures > 0 {
state.onepassSlots = make([]int, cfg.numCaptures*2)
state.onepassCache = onepass.NewCache(cfg.numCaptures)
}
Expand Down
40 changes: 39 additions & 1 deletion nfa/pikevm.go
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,8 @@ type MatchWithCaptures struct {
Captures [][]int // Captures[i] = [start, end] for group i, or nil if not captured
}

// NewPikeVM creates a new PikeVM for executing the given NFA
// NewPikeVM creates a new PikeVM for executing the given NFA.
// Internal state (thread queues, sparse set) is pre-allocated immediately.
func NewPikeVM(nfa *NFA) *PikeVM {
p := &PikeVM{
nfa: nfa,
Expand All @@ -269,6 +270,31 @@ func NewPikeVM(nfa *NFA) *PikeVM {
return p
}

// NewPikeVMLazy creates a new PikeVM with deferred internal state allocation.
// The NFA reference is stored immediately, but the expensive mutable state
// (thread queues, sparse set, epsilon stack) is NOT allocated until the first
// search method is called. This saves ~10 KB per 100-state NFA.
//
// Issue #158: In WAF workloads with ~900 compiled patterns, each SearchState's
// PikeVM is pre-allocated but may never be used (e.g., CharClass, Teddy, AC
// strategies rarely need PikeVM). Lazy initialization avoids this waste.
//
// The first search call will be slightly slower due to allocation, but subsequent
// calls reuse the allocated state (no per-search allocation overhead).
func NewPikeVMLazy(nfa *NFA) *PikeVM {
return &PikeVM{
nfa: nfa,
}
}

// ensureInternalState lazily initializes the internal PikeVMState if needed.
// Called at the entry point of every search method that uses internalState.
func (p *PikeVM) ensureInternalState() {
if p.internalState.Visited == nil {
p.initState(&p.internalState)
}
}

// initState initializes a PikeVMState for use with this PikeVM.
// Call this to prepare a state before using it with *WithState methods.
func (p *PikeVM) initState(state *PikeVMState) {
Expand Down Expand Up @@ -387,6 +413,7 @@ func updateCapture(caps cowCaptures, groupIndex uint32, isStart bool, pos int) c
// This method uses internal state and is NOT thread-safe.
// For concurrent usage, use SearchWithState.
func (p *PikeVM) Search(haystack []byte) (int, int, bool) {
p.ensureInternalState()
return p.SearchAt(haystack, 0)
}

Expand All @@ -397,6 +424,7 @@ func (p *PikeVM) Search(haystack []byte) (int, int, bool) {
// This is significantly faster than Search() when you only need to know
// if a match exists, not where it is.
func (p *PikeVM) IsMatch(haystack []byte) bool {
p.ensureInternalState()
if len(haystack) == 0 {
return p.matchesEmpty()
}
Expand Down Expand Up @@ -717,6 +745,7 @@ func (p *PikeVM) addThreadToNextForMatch(id StateID, haystack []byte, pos int) {
// Unlike Search, it takes the FULL haystack and a starting position, so assertions
// like ^ correctly check against the original input start, not a sliced position.
func (p *PikeVM) SearchAt(haystack []byte, at int) (int, int, bool) {
p.ensureInternalState()
if at > len(haystack) {
return -1, -1, false
}
Expand Down Expand Up @@ -865,6 +894,7 @@ func (p *PikeVM) searchUnanchoredAt(haystack []byte, startAt int) (int, int, boo
//
// Performance: O(maxEnd - startAt) instead of O(len(haystack) - startAt).
func (p *PikeVM) SearchBetween(haystack []byte, startAt, maxEnd int) (int, int, bool) {
p.ensureInternalState()
if startAt > len(haystack) || startAt >= maxEnd {
return -1, -1, false
}
Expand Down Expand Up @@ -963,6 +993,7 @@ func (p *PikeVM) searchUnanchoredBetween(haystack []byte, startAt, maxEnd int) (
// SearchWithCaptures finds the first match with capture group positions.
// Returns nil if no match is found.
func (p *PikeVM) SearchWithCaptures(haystack []byte) *MatchWithCaptures {
p.ensureInternalState()
return p.SearchWithCapturesAt(haystack, 0)
}

Expand All @@ -973,6 +1004,7 @@ func (p *PikeVM) SearchWithCaptures(haystack []byte) *MatchWithCaptures {
// This method is used by FindAll* operations to correctly handle anchors like ^.
// Unlike SearchWithCaptures, it takes the FULL haystack and a starting position.
func (p *PikeVM) SearchWithCapturesAt(haystack []byte, at int) *MatchWithCaptures {
p.ensureInternalState()
if at > len(haystack) {
return nil
}
Expand Down Expand Up @@ -1176,6 +1208,7 @@ func (p *PikeVM) searchAtWithCaptures(haystack []byte, startPos int) *MatchWithC
//
//nolint:gocognit // Merged match-check + step loop (Rust's nexts pattern) is inherently complex
func (p *PikeVM) SearchWithCapturesInSpan(haystack []byte, spanStart, spanEnd int) *MatchWithCaptures {
p.ensureInternalState()
if spanStart > spanEnd || spanEnd > len(haystack) {
return nil
}
Expand Down Expand Up @@ -1279,6 +1312,7 @@ func (p *PikeVM) buildCapturesResult(caps []int, matchStart, matchEnd int) [][]i
// SearchAll finds all non-overlapping matches in the haystack.
// Returns a slice of matches in order of occurrence.
func (p *PikeVM) SearchAll(haystack []byte) []Match {
p.ensureInternalState()
var matches []Match
pos := 0

Expand Down Expand Up @@ -1661,6 +1695,7 @@ func checkLookAssertion(look Look, haystack []byte, pos int) bool {
//
// This method uses internal state and is NOT thread-safe.
func (p *PikeVM) SearchWithSlotTable(haystack []byte, mode SearchMode) (int, int, bool) {
p.ensureInternalState()
return p.SearchWithSlotTableAt(haystack, 0, mode)
}

Expand All @@ -1674,6 +1709,7 @@ func (p *PikeVM) SearchWithSlotTable(haystack []byte, mode SearchMode) (int, int
//
// Returns (start, end, found) for the first match.
func (p *PikeVM) SearchWithSlotTableAt(haystack []byte, at int, mode SearchMode) (int, int, bool) {
p.ensureInternalState()
if at > len(haystack) {
return -1, -1, false
}
Expand Down Expand Up @@ -2140,13 +2176,15 @@ func (p *PikeVM) addSearchThreadToNext(t searchThread, srcState StateID, haystac
// SearchWithSlotTableCaptures finds the first match and returns captures.
// Uses zero-allocation SlotTable architecture (Rust approach).
func (p *PikeVM) SearchWithSlotTableCaptures(haystack []byte) *MatchWithCaptures {
p.ensureInternalState()
return p.SearchWithSlotTableCapturesAt(haystack, 0)
}

// SearchWithSlotTableCapturesAt finds the first match with captures starting from 'at'.
// Uses dual SlotTable (curr/next) for zero-allocation capture tracking.
// Matches Rust's PikeVM Cache with curr/next ActiveStates (pikevm.rs:1878).
func (p *PikeVM) SearchWithSlotTableCapturesAt(haystack []byte, at int) *MatchWithCaptures {
p.ensureInternalState()
if at > len(haystack) {
return nil
}
Expand Down
Loading