earley to lalr parsing for cypher by mj3cheun · Pull Request #1653 · graphistry/pygraphistry

mj3cheun · 2026-06-26T22:41:08Z

Switch GFQL Cypher WHERE parser from Earley to LALR(1) (#1031)

Summary

Unifies the Cypher WHERE grammar and replaces the Earley parser with a single LALR(1) parser. The dual where_clause: where_predicates | expr rule had a FIRST-set ambiguity that forced Earley; collapsing it to one generic boolean expr makes the grammar LALR(1)-parseable. The structured-vs-row routing the old grammar did in-grammar is now recovered by a post-parse lift, so behavior is preserved (and slightly extended) while parsing gets dramatically cheaper.

What changed

Grammar

where_clause unified to "WHERE"i expr -> generic_where_clause (removed the where_predicates alternative). where_predicate is retained only as a start symbol for the lift parser.

Parsers (parser.py)

_parser() (Earley) → _parser_lalr() (parser="lalr"), now the sole whole-query parser.
_pattern_parser() switched Earley → LALR(1).
New _where_predicate_parser() (start="where_predicate", LALR) backing the lift.

Post-parse routing lift (new, parser.py)

_lift_atom_as_where_predicate() — re-parses one WHERE atom into a structured WherePredicate, or None.
_lift_and_spine_predicates() — flattens the top-level AND spine and lifts every conjunct; all-or-nothing. Wired into generic_where_clause: a fully-liftable clause becomes structured predicates (→ filter_dict), otherwise the whole clause stays on expr_tree (→ where_rows).

Removed

The now-unreachable where_predicates and where_clause transformer methods.

Docs / tests

New parser tests: single-LALR corpus parse, Earley-removal assertions, and flat/nested/OR lift-routing parity.
CHANGELOG entry; ai/gfql_where_routing_optimization.md documenting deferred optimizations (OR→is_in, partial pushdown) for backend-aware reimplementation.

Behavior impact

No query-syntax changes. Every previously-supported query still parses, including OR/XOR/NOT-in-WHERE and post-WITH WHERE.
Query results unchanged.
One internal routing change (result-identical): nested/parenthesized ANDs of simple predicates — e.g. WHERE a = 1 AND (b = 2 AND c = 3) — now route to the structured filter_dict fast path. The old flat where_predicate ("AND" where_predicate)* rule couldn't see through parentheses and sent these to where_rows. The reroute is safe by AND-associativity (a AND (b AND c) ≡ a AND b AND c) and one-directional (only where_rows → filter_dict, never the reverse); flat-AND routing is byte-for-byte unchanged.

Performance impact

Parse: LALR(1) replaces Earley on every supported query (~80× faster parse path per the former Earley cost).

Execution (nested-AND reroute, cuDF, 1M rows, measured):

nested-AND workload	before (`where_rows`)	after (`filter_dict`)	speedup
equality `a=1 AND (b=2 AND c=3)`	364 ms	80 ms	4.5×
string `CONTAINS` (3 terms)	102 ms	69 ms	1.5×
low-selectivity `<>` (each keeps ~90%)	350 ms	77 ms	4.5×

The win scales with the number of AND terms (each term is a full-frame eager-3VL pass under where_rows), so multi-term nested ANDs are exactly the favorable case. No new pessimization class is introduced — flat ANDs already used filter_dict; this just makes the parenthesized form consistent.

Tests

graphistry/tests/compute/gfql/cypher/test_parser.py: 165 passing
Full graphistry/tests/compute/gfql/cypher/: 1655 passing, 15 xfailed (pre-existing)

…ows routing via post-parse lift

lmeyerov · 2026-06-27T00:10:14Z

I found a behavior regression with a focused base-vs-PR pressure test on dgx-spark, using the RAPIDS/cuDF image graphistry/test-rapids-official:26.02-gfql with cuDF 26.02.01. This was not just parser-shape testing: I ran the same Cypher queries against base f08c4588 and current PR head 036278ce, in both pandas and cuDF.

The failing case is a missing-property null check inside a parenthesized AND:

MATCH (n)
WHERE n.missing IS NULL AND (n.x = 1)
RETURN n.id AS id
ORDER BY id

Base returns rows where x = 1, treating the absent property as null:

[{"id": "a"}, {"id": "c"}, {"id": "d"}]

PR head raises in both pandas and cuDF:

GFQLSchemaError: Column "missing" does not exist in node dataframe

The same problem happens for:

MATCH (n)
WHERE n.missing IS NOT NULL AND (n.x = 1)
RETURN n.id AS id
ORDER BY id

Base returns no rows. PR head raises the same missing-column error in both pandas and cuDF.

A fixed-length nested-AND control case still passes on both base and PR, so the issue is not every nested AND. The unsafe change is lifting IS NULL / IS NOT NULL checks on possibly absent properties into the early property-filter path. The old row-filter path preserves missing-property-as-null behavior; the new early property filter requires the dataframe column to exist.

Suggested fix: do not lift missing-property-sensitive null checks into early property filters, or make the early property filter match row-filter behavior for absent properties.

Remote repro artifacts from my run:

Script: /tmp/lmeyerov-codex-pr1653-20260626-233411/focused_where_pressure.py
Report: /tmp/lmeyerov-codex-pr1653-20260626-233411/focused-where-pressure-report-r2.md
Raw outputs: /tmp/lmeyerov-codex-pr1653-20260626-233411/focused-base-r2.json and /tmp/lmeyerov-codex-pr1653-20260626-233411/focused-head-r2.json

mj3cheun force-pushed the dev/cypher-lalr-parser-switch branch from 48fe494 to d34f351 Compare June 26, 2026 22:44

switch WHERE parser from Earley to LALR; preserve filter_dict/where_r…

0c836a1

…ows routing via post-parse lift

mj3cheun force-pushed the dev/cypher-lalr-parser-switch branch from d34f351 to 0c836a1 Compare June 26, 2026 22:47

fix predicate span source character offsets

036278c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

earley to lalr parsing for cypher#1653

earley to lalr parsing for cypher#1653
mj3cheun wants to merge 2 commits into
masterfrom
dev/cypher-lalr-parser-switch

mj3cheun commented Jun 26, 2026 •

edited

Loading

Uh oh!

lmeyerov commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mj3cheun commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Switch GFQL Cypher WHERE parser from Earley to LALR(1) (#1031)

Summary

What changed

Behavior impact

Performance impact

Tests

Uh oh!

lmeyerov commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mj3cheun commented Jun 26, 2026 •

edited

Loading