Skip to content

earley to lalr parsing for cypher#1653

Open
mj3cheun wants to merge 2 commits into
masterfrom
dev/cypher-lalr-parser-switch
Open

earley to lalr parsing for cypher#1653
mj3cheun wants to merge 2 commits into
masterfrom
dev/cypher-lalr-parser-switch

Conversation

@mj3cheun

@mj3cheun mj3cheun commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Switch GFQL Cypher WHERE parser from Earley to LALR(1) (#1031)

Summary

Unifies the Cypher WHERE grammar and replaces the Earley parser with a single LALR(1) parser. The dual where_clause: where_predicates | expr rule had a FIRST-set ambiguity that forced Earley; collapsing it to one generic boolean expr makes the grammar LALR(1)-parseable. The structured-vs-row routing the old grammar did in-grammar is now recovered by a post-parse lift, so behavior is preserved (and slightly extended) while parsing gets dramatically cheaper.

What changed

Grammar

  • where_clause unified to "WHERE"i expr -> generic_where_clause (removed the where_predicates alternative). where_predicate is retained only as a start symbol for the lift parser.

Parsers (parser.py)

  • _parser() (Earley) → _parser_lalr() (parser="lalr"), now the sole whole-query parser.
  • _pattern_parser() switched Earley → LALR(1).
  • New _where_predicate_parser() (start="where_predicate", LALR) backing the lift.

Post-parse routing lift (new, parser.py)

  • _lift_atom_as_where_predicate() — re-parses one WHERE atom into a structured WherePredicate, or None.
  • _lift_and_spine_predicates() — flattens the top-level AND spine and lifts every conjunct; all-or-nothing. Wired into generic_where_clause: a fully-liftable clause becomes structured predicates (→ filter_dict), otherwise the whole clause stays on expr_tree (→ where_rows).

Removed

  • The now-unreachable where_predicates and where_clause transformer methods.

Docs / tests

  • New parser tests: single-LALR corpus parse, Earley-removal assertions, and flat/nested/OR lift-routing parity.
  • CHANGELOG entry; ai/gfql_where_routing_optimization.md documenting deferred optimizations (OR→is_in, partial pushdown) for backend-aware reimplementation.

Behavior impact

  • No query-syntax changes. Every previously-supported query still parses, including OR/XOR/NOT-in-WHERE and post-WITH WHERE.
  • Query results unchanged.
  • One internal routing change (result-identical): nested/parenthesized ANDs of simple predicates — e.g. WHERE a = 1 AND (b = 2 AND c = 3) — now route to the structured filter_dict fast path. The old flat where_predicate ("AND" where_predicate)* rule couldn't see through parentheses and sent these to where_rows. The reroute is safe by AND-associativity (a AND (b AND c) ≡ a AND b AND c) and one-directional (only where_rows → filter_dict, never the reverse); flat-AND routing is byte-for-byte unchanged.

Performance impact

  • Parse: LALR(1) replaces Earley on every supported query (~80× faster parse path per the former Earley cost).

  • Execution (nested-AND reroute, cuDF, 1M rows, measured):

    nested-AND workload before (where_rows) after (filter_dict) speedup
    equality a=1 AND (b=2 AND c=3) 364 ms 80 ms 4.5×
    string CONTAINS (3 terms) 102 ms 69 ms 1.5×
    low-selectivity <> (each keeps ~90%) 350 ms 77 ms 4.5×

    The win scales with the number of AND terms (each term is a full-frame eager-3VL pass under where_rows), so multi-term nested ANDs are exactly the favorable case. No new pessimization class is introduced — flat ANDs already used filter_dict; this just makes the parenthesized form consistent.

Tests

  • graphistry/tests/compute/gfql/cypher/test_parser.py: 165 passing
  • Full graphistry/tests/compute/gfql/cypher/: 1655 passing, 15 xfailed (pre-existing)

@mj3cheun mj3cheun force-pushed the dev/cypher-lalr-parser-switch branch from 48fe494 to d34f351 Compare June 26, 2026 22:44
@mj3cheun mj3cheun force-pushed the dev/cypher-lalr-parser-switch branch from d34f351 to 0c836a1 Compare June 26, 2026 22:47
@lmeyerov

Copy link
Copy Markdown
Contributor

I found a behavior regression with a focused base-vs-PR pressure test on dgx-spark, using the RAPIDS/cuDF image graphistry/test-rapids-official:26.02-gfql with cuDF 26.02.01. This was not just parser-shape testing: I ran the same Cypher queries against base f08c4588 and current PR head 036278ce, in both pandas and cuDF.

The failing case is a missing-property null check inside a parenthesized AND:

MATCH (n)
WHERE n.missing IS NULL AND (n.x = 1)
RETURN n.id AS id
ORDER BY id

Base returns rows where x = 1, treating the absent property as null:

[{"id": "a"}, {"id": "c"}, {"id": "d"}]

PR head raises in both pandas and cuDF:

GFQLSchemaError: Column "missing" does not exist in node dataframe

The same problem happens for:

MATCH (n)
WHERE n.missing IS NOT NULL AND (n.x = 1)
RETURN n.id AS id
ORDER BY id

Base returns no rows. PR head raises the same missing-column error in both pandas and cuDF.

A fixed-length nested-AND control case still passes on both base and PR, so the issue is not every nested AND. The unsafe change is lifting IS NULL / IS NOT NULL checks on possibly absent properties into the early property-filter path. The old row-filter path preserves missing-property-as-null behavior; the new early property filter requires the dataframe column to exist.

Suggested fix: do not lift missing-property-sensitive null checks into early property filters, or make the early property filter match row-filter behavior for absent properties.

Remote repro artifacts from my run:

  • Script: /tmp/lmeyerov-codex-pr1653-20260626-233411/focused_where_pressure.py
  • Report: /tmp/lmeyerov-codex-pr1653-20260626-233411/focused-where-pressure-report-r2.md
  • Raw outputs: /tmp/lmeyerov-codex-pr1653-20260626-233411/focused-base-r2.json and /tmp/lmeyerov-codex-pr1653-20260626-233411/focused-head-r2.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants