earley to lalr parsing for cypher#1653
Conversation
48fe494 to
d34f351
Compare
…ows routing via post-parse lift
d34f351 to
0c836a1
Compare
|
I found a behavior regression with a focused base-vs-PR pressure test on The failing case is a missing-property null check inside a parenthesized MATCH (n)
WHERE n.missing IS NULL AND (n.x = 1)
RETURN n.id AS id
ORDER BY idBase returns rows where [{"id": "a"}, {"id": "c"}, {"id": "d"}]PR head raises in both pandas and cuDF: The same problem happens for: MATCH (n)
WHERE n.missing IS NOT NULL AND (n.x = 1)
RETURN n.id AS id
ORDER BY idBase returns no rows. PR head raises the same missing-column error in both pandas and cuDF. A fixed-length nested-AND control case still passes on both base and PR, so the issue is not every nested Suggested fix: do not lift missing-property-sensitive null checks into early property filters, or make the early property filter match row-filter behavior for absent properties. Remote repro artifacts from my run:
|
Switch GFQL Cypher WHERE parser from Earley to LALR(1) (#1031)
Summary
Unifies the Cypher WHERE grammar and replaces the Earley parser with a single LALR(1) parser. The dual
where_clause: where_predicates | exprrule had a FIRST-set ambiguity that forced Earley; collapsing it to one generic booleanexprmakes the grammar LALR(1)-parseable. The structured-vs-row routing the old grammar did in-grammar is now recovered by a post-parse lift, so behavior is preserved (and slightly extended) while parsing gets dramatically cheaper.What changed
Grammar
where_clauseunified to"WHERE"i expr -> generic_where_clause(removed thewhere_predicatesalternative).where_predicateis retained only as a start symbol for the lift parser.Parsers (
parser.py)_parser()(Earley) →_parser_lalr()(parser="lalr"), now the sole whole-query parser._pattern_parser()switched Earley → LALR(1)._where_predicate_parser()(start="where_predicate", LALR) backing the lift.Post-parse routing lift (new,
parser.py)_lift_atom_as_where_predicate()— re-parses one WHERE atom into a structuredWherePredicate, orNone._lift_and_spine_predicates()— flattens the top-level AND spine and lifts every conjunct; all-or-nothing. Wired intogeneric_where_clause: a fully-liftable clause becomes structuredpredicates(→filter_dict), otherwise the whole clause stays onexpr_tree(→where_rows).Removed
where_predicatesandwhere_clausetransformer methods.Docs / tests
ai/gfql_where_routing_optimization.mddocumenting deferred optimizations (OR→is_in, partial pushdown) for backend-aware reimplementation.Behavior impact
WHERE a = 1 AND (b = 2 AND c = 3)— now route to the structuredfilter_dictfast path. The old flatwhere_predicate ("AND" where_predicate)*rule couldn't see through parentheses and sent these towhere_rows. The reroute is safe by AND-associativity (a AND (b AND c) ≡ a AND b AND c) and one-directional (onlywhere_rows → filter_dict, never the reverse); flat-AND routing is byte-for-byte unchanged.Performance impact
Parse: LALR(1) replaces Earley on every supported query (~80× faster parse path per the former Earley cost).
Execution (nested-AND reroute, cuDF, 1M rows, measured):
where_rows)filter_dict)a=1 AND (b=2 AND c=3)CONTAINS(3 terms)<>(each keeps ~90%)The win scales with the number of AND terms (each term is a full-frame eager-3VL pass under
where_rows), so multi-term nested ANDs are exactly the favorable case. No new pessimization class is introduced — flat ANDs already usedfilter_dict; this just makes the parenthesized form consistent.Tests
graphistry/tests/compute/gfql/cypher/test_parser.py: 165 passinggraphistry/tests/compute/gfql/cypher/: 1655 passing, 15 xfailed (pre-existing)