feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline#1648
Open
lmeyerov wants to merge 24 commits into
Open
feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline#1648lmeyerov wants to merge 24 commits into
lmeyerov wants to merge 24 commits into
Conversation
82fb11c to
561cf37
Compare
561cf37 to
0f37fed
Compare
bc68d6b to
35f65b5
Compare
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Terminal Cypher `RETURN a` (whole node/edge) previously emitted one column of
Cypher display strings (`({id: 51, val: 51, kind: 'a'})`) built row-wise. The
string is a *presentation* format (it matches the cypher-shell / TCK oracle),
not data — callers had to re-parse it to use it, and constructing it is O(rows).
This flattens whole-entity returns into structured `{alias}.{field}` columns
(`a.id, a.val, a.kind`, ...) by default. The per-field columns already exist on
the working frame before projection, so this is "stop collapsing", not
"rebuild": near-free, lossless, directly usable, and it survives JSON / CSV /
Parquet / Arrow serialization and `plot()`.
Measured (dgx-spark, median-of-7, RETURN a vs old text form):
pandas @100k 32 vs 204 ms (6.4x); cuDF @100k 27 vs 114 ms (4.3x). Win grows
with row count (text render is O(rows); flat is ~free).
Design:
- `apply_result_projection(..., structured=True)` emits flat columns for
whole-entity returns; `structured=False` keeps the legacy single
Cypher-display-string column. The OPTIONAL-MATCH null-fill / projection
row-guard paths (which still consume a single-column entity value for row
alignment) opt out via this flag and are unchanged.
- A synthesized null/absent-entity row (top-level OPTIONAL-MATCH miss or
OPTIONAL WITH-reentry no-match, built by `_apply_empty_result_row` as a
single `{alias: None}` column) has no field columns to flatten, so it falls
back to the single-column text form — rendering to None and preserving the
shape the OPTIONAL / reentry machinery consumes for identity recovery and
no-match detection. Real rows always carry flat fields and flatten.
- Text is now presentation-only: `render_entity_text(result, alias)`
reconstructs the Cypher display string on demand (used by the conformance /
TCK driver and any caller wanting the human-readable form). The structured
data path never pays the render cost.
- The entity-projection meta `ids` snapshot (`.copy()`) is retained — bounded
reentry recovers carried node identities from it and must not alias the live
frame (#1356).
Tests: whole-entity text assertions migrated to a `entity_text_records` shim
that renders flat -> text for comparison against the pre-#1650 Cypher-text
oracle; grouping / connected-optional / null_fill paths (still single-column
text) keep direct text assertions; flat-shape + render-helper + meta tests
added. gfql/cypher + row suites: 1646 passed, 15 xfailed (only the unrelated
in-container networkx setup.py packaging artifact fails).
Cross-repo follow-ups (separate, after this lands): tck-gfql conformance
adapter (structured -> text at the comparison hook) and pyg-bench probes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tructured returns Squashed reconciliation of the native lazy Polars GFQL engine (was #1648's 28 commits; full history preserved at tag bak/1648) restacked onto the colleague's #1656 structured whole-entity returns + #1657 parse_expr memo. Engine: native polars hop/chain (semi/anti joins), native cypher row pipeline (select/where/order_by/group_by/unwind/projection), lazy single-hop collect-once with CPU/GPU execution targets (gfql/lazy/). NO pandas bridge — native or honest NotImplementedError (plan.md NO-CHEATING). Reconciliation with #1650 structured returns: apply_result_projection now threads `structured` to the polars path (apply_result_projection_polars). Whole-entity RETURN a flattens to {alias}.{field} columns natively (mirrors the pandas _flat_entity_field_names selection exactly), which — unlike the legacy entity-text expr — works for ANY dtype (float/temporal/nested just become columns), so polars structured == pandas structured across the board. structured=False still renders the native Cypher display string for int/string/bool single-entity nodes. _include_numeric_id_as_property is now polars-aware so id flattens identically. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…alized hops Build the whole forward/backward combine (combine_nodes/edges + endpoint + alias names) as ONE deferred pl.LazyFrame plan over the already-materialized hop frames and collect once, instead of ~a dozen eager ops that each internally lazy().op().collect(). Stable order columns (NORD/EORD) restore the eager g._nodes/g._edges order since lazy joins don't preserve it -> trailing LIMIT/SKIP unaffected, byte-identical (full polars conformance + row-pipeline parity, 2858 gfql tests). NO recompute (inputs materialized; unlike the disproven whole-chain fusion). ~5% faster polars 1-hop chain @1M/@10m; GPU-target neutral. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A single MATCH (n) with no edge hop — the dominant tabular/crossfilter shape (MATCH (n) WHERE/RETURN ..., histograms, filters, table search) — now returns the filtered node table directly and skips the whole forward/backward/combine + collect_all (~2.5 ms fixed cost that dominated small/interactive queries). Byte-identical (full polars conformance + row-pipeline parity, 389 polars tests). Moves the polars>pandas crossover BELOW 100K for real product workloads: categorical histogram 0.68->1.70x @100k / 1.38->7.62x @1m; node filter 2.44->13.85x @1m; timeline 2.55->8.12x @1m. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A single MATCH (a)-[e]->(b) with both nodes unconstrained (no filter/name/query) and a plain edge (no match/name/query) — the basic graph query and the viz edge-crossfilter MATCH — returns ALL edges + their endpoint nodes directly (direction-independent; isolated nodes excluded), skipping forward/backward/ combine. For unconstrained nodes the backward pass prunes nothing, so this is byte-identical (full polars conformance + row-pipeline parity + adversarial graphs: dup/self-loop/cycle/isolated). ~9x faster polars [n,e,n]: 95.6->10.3 ms @1m, 855->99 ms @10m. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the unconstrained 1-hop fast path to filtered nodes: MATCH (a {f})-[e]->(b)
(src/dst/both filters; the dominant "filter then expand" viz crossfilter pattern)
returns the edges whose endpoints pass the node filters + those endpoint nodes,
skipping forward/backward/combine. For one hop the backward pass prunes nothing
beyond the endpoint filters, so byte-identical (verified vs pandas: src/dst/both
filters, reverse, dup/self-loop/cycle/isolated; full polars conformance +
row-pipeline parity, 2858 gfql tests). Unconstrained: all edges any direction;
filtered: forward/reverse (filtered-undirected falls through to the full path).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
35f65b5 to
e9f29bd
Compare
…aster) The GPU target collects with pl.GPUEngine(executor="in-memory") instead of the default streaming engine="gpu" (DefaultSingletonEngine). GFQL results fit in device memory, the in-memory engine's regime: faster on the hop primitives (semijoin 1.33x, antijoin 2.58x, unique 1.49x @10m) and far more STABLE -- the streaming executor spiked bimodally to ~1s on the same semijoin (median ~360ms), in-memory holds ~30ms. Fixes the GPU instability seen in the pr11 measurements. Parity preserved (polars-gpu == polars, 39 tests). gfql chains aren't GPU-compute-bound (orchestration + eager fast paths dominate) so this is a stability/correctness fix for GPU-collect paths, not a chain speedup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per the #1656 author's handoff: the elif-structured single-column text fallback in _apply_result_projection_pandas looks redundant but fixes two regressions (top-level OPTIONAL-MATCH miss; OPTIONAL-WITH-reentry no-match). Mark DO NOT REMOVE so a later 'tidy' doesn't reintroduce them. Our polars structured-returns reconciliation touched this file; verified the fallback is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
… OTel span placement Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic fast paths) found 2 bugs: - BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns join keys; the chain fast paths + combine did not. Added _align_edge_endpoints (cast endpoints to node-id dtype for the traversal, restore output dtype to match pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop. - (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal _try_chain_fast_path probe (inserted between the decorator and def chain) instead of the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it. Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass (4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes). Row-order divergences the review also found (fast path returns table order vs the full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher- undefined, sets/values identical, no repo test depends on it) are claim-precision, not bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pandas drop_duplicates) RETURN ... UNION RETURN ... (distinct) crashed under engine='polars'/'polars-gpu' with AttributeError: 'DataFrame' object has no attribute 'drop_duplicates' — the union de-dup in gfql_unified._execute_compiled_query called pandas-only drop_duplicates on a polars frame. Added engine-aware Engine.df_unique (polars unique(maintain_order=True); pandas/cuDF drop_duplicates(keep='first')), matching the row/frame_ops.distinct convention, and routed the UNION DISTINCT through it. Surfaced by the cross-repo TCK conformance run (tck-gfql TEST_POLARS=1, union1). Regression-tested in test_engine_polars_cypher_conformance.py (4 UNION cases). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t to Boolean)
null AND null / null OR null / NOT null crashed under engine='polars' with
InvalidOperationError ('bitand'/'not' not supported for dtype null): a bare null
literal lowers to a Null-dtype polars expr where &/|/~ are undefined. Cast AND/OR/
NOT operands to pl.Boolean in the expr lowering so Cypher Kleene 3-valued logic
evaluates (true AND null=null, false OR null=null, NOT null=null); casting a real
Boolean column is a no-op, and polars Boolean &/|/~ already match Cypher Kleene.
Surfaced by the TCK run (expr-boolean1/2/4). Regression-tested in
test_engine_polars_cypher_conformance.py (bare-RETURN null boolean cases).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ot ArrowInvalid A pandas object column holding mixed Python types (e.g. int 0 + str 'xx' — legal for dynamically-typed Cypher properties) is unrepresentable in polars/Arrow: pl.from_pandas raised a cryptic 'pyarrow.lib.ArrowInvalid: Could not convert xx with type str: tried to convert to int64' from deep inside construction. Wrap the pandas->polars conversion in Engine.df_to_engine (_pl_from_pandas) to raise a clear NotImplementedError naming the offending column(s) and pointing at engine='pandas' (NO-CHEATING: no silent string-coercion, which would change comparison semantics). Surfaced by the TCK run (expr-comparison2, match-where5, with-where5). The harness tolerates honest NIE as a coverage decline; before this they crashed as failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…projection
A property column holding Cypher temporal-constructor text (date({year:1910,...}),
how Cypher/TCK store temporal values) leaked the raw constructor string under
engine='polars' instead of the ISO form ('1910-05-06') the pandas projection
produces via _normalize_temporal_constructor_series. That normalizer is not yet
native, so both projection paths (engine_polars.projection final result projection
+ row_pipeline.select_polars WITH/RETURN) now detect temporal-constructor String
columns (reusing TEMPORAL_CALL_EXPR_RE, native .str.contains scan over String cols
only) and raise NotImplementedError rather than emit a wrong rendering.
Surfaced by the TCK run (with-orderby1-33+, the largest wrong-answer cluster, ~33).
Whole-entity RETURN a over a temporal property is unaffected (flattens + renders via
render_entity_text). Regression-tested in test_engine_polars_cypher_conformance.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… +/-)
a.time + duration({minutes: 6}) silently became STRING CONCATENATION under
engine='polars': cypher duration({...}) translates to an ISO duration string
literal ('PT6M'), and the expr lowering applied + to two strings, so an ORDER BY
sorted lexicographically on the concatenated text (wrong order). The lowering now
raises NotImplementedError when +/- has an ISO-duration string-literal operand
(^-?P(?=[0-9T]), which doesn't misfire on ordinary strings like 'Prefix'); the
pandas engine handles temporal arithmetic.
Surfaced by the TCK run (with-orderby2 cluster, silent wrong-order). Regression-
tested in test_engine_polars_cypher_conformance.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n native lowering filter_by_dict on engine='polars' evaluated any non-natively-lowerable predicate by converting the column to pandas (.to_pandas()), running the pandas callable, and carrying the mask back — a silent polars->pandas bridge presenting pandas semantics as polars. Removed it: unsupported predicates now raise NotImplementedError (use engine='pandas'). To keep common queries native, widened predicate_to_expr: - AllOf (conjunction, e.g. n.val > 20 AND n.val < 90 -> AllOf[GT,LT]) lowered recursively - IsNull/IsNA -> is_null(), NotNull/NotNA -> is_not_null() - case-insensitive STARTS WITH / ENDS WITH via anchored (?i) regex on re.escape'd literal Surfaced from the source-mined optimization review (pygraphistry4 opportunity #6 — a flagged NO-CHEATING violation in the shipping polars lane). The old fallback test (which asserted the bridge worked) now asserts the honest NIE. TCK: no wrong-answer regression; ~39 scenarios that silently passed via the bridge now honestly decline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….iloc crash) A bounded MATCH ... WITH <scalar> ... MATCH query crashed under engine='polars' with AttributeError: 'DataFrame' object has no attribute 'iloc' — the engine- agnostic re-entry broadcast (cypher/reentry/execution.py) used pandas .iloc / .assign / .drop(columns=) on a polars frame. Added engine-aware helpers (polars row(i, named=True) + with_columns(pl.lit(...)) / drop / head(0)) for the scalar-row extraction + constant-column broadcast. Re-entry now completes; a downstream RETURN the polars engine can't yet render raises honest NotImplementedError, not a crash. Surfaced by the TCK run (with2-1, with4-2, expr-typeconversion2/3/4-*). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ot validation error)
A multi-clause OPTIONAL MATCH needing null-row fill (some seed rows unmatched)
raised GFQLValidationError ('unsupported-cypher-query ... null-row alignment could
not recover matched seed identities') under engine='polars' — the null-fill
alignment (matched-id meta, .iloc row slicing, per-segment concat) is pandas-centric
and the polars OPTIONAL MATCH doesn't populate the _cypher_entity_projection_meta
['ids'] it needs. Guarded the polars path to raise NotImplementedError instead (the
honest 'not native yet' signal the TCK harness tolerates) — pandas runs these fine.
Surfaced by the TCK run (match7-7, expr-graph4-4).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
OPTIONAL MATCH (n) RETURN n with no match rendered the absent whole entity as '()' under engine='polars' instead of null — the native entity-text expr didn't nullify absent rows (whose alias marker column is null). Now wraps the rendered text with pl.when(col(alias).is_null()).then(None) (mirrors pandas _nullify_missing_alias_rows); a real property-less node still renders '()'. Surfaced by the TCK run (match7-1). Regression-tested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (not == cast crash) A label match MATCH (n:Label) targeting the reserved 'labels' List column (a label with no one-hot label__X column: typed-schema unknown labels, OPTIONAL MATCH to a non-existent label) crashed under engine='polars' with InvalidOperationError: cannot cast List type to String — filter_by_dict_polars lowered it to a scalar == that tried to cast the List to String. Now uses pl.col(c).list.contains(val) for List-dtype columns: correct Cypher label-membership (Label in n.labels), empty for a non-existent label (matching pandas). Surfaced by the TCK run (match7-28, firstparty-typed-schema1-3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NaN: comparisons over a NaN computed inside polars (0.0/0.0 > 1) used polars' semantics (NaN = largest value, NaN>1 True), but IEEE/Python/pandas/Neo4j-Cypher compare any NaN false (!= true). The expr lowering now masks float comparisons to the IEEE answer (& ~is_nan for < > <= >= =, | is_nan for <> !=), gated by conservative float-operand inference (via a free schema contextvar) so int/string/ bool comparisons are untouched and is_nan() never hits a non-float expr. Input NaN is already nan_to_null'd by pl.from_pandas, so this only affects in-query float math. Numeric-vs-string: comparing a number to a string (n.val > 'a', 0.0/0.0 > 'a') crashed with ComputeError: cannot compare string with numeric type. Detect the mismatch in both the expression path (lower_expr) and the folded filter-predicate path (filter_by_dict_polars) and raise honest NotImplementedError, not a crash. Surfaced by the TCK run (expr-comparison2-5-*, the 4-scenario NaN cluster). Regression-tested in test_engine_polars_cypher_conformance.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aphic wrong-answer)
Comparing cypher temporal values (time({...}) > time({...}), date < date) gave a
WRONG answer under engine='polars': the cypher->gfql lowering renders the
constructors to ISO strings ('10:00+01:00'), and the polars engine compared them
LEXICOGRAPHICALLY — wrong across timezones/precision (pandas parses them temporally).
The lowering now detects an ISO date/datetime/time string-literal operand in a
comparison (specific regex; requires seconds-or-tz on bare times so ordinary '10:00'
strings don't match) and raises honest NotImplementedError. Native temporal-typed
comparison is the tracked proper fix.
With this, the native polars engine has ZERO wrong-answers across the full Cypher
TCK (3834 passed / 0 failed / 388 honest declines) — every scenario matches pandas
or honestly declines. Surfaced by the TCK run (expr-temporal7).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oral guards Multi-wave adversarial review of the session conformance fixes found 3 BLOCKERs (silent wrong-answers/panics, all NO-CHEATING violations) + IMPORTANTs: - BLOCKER: NaN guard missed int/int->Float division and function results (abs/ coalesce) -> polars NaN-as-largest leaked as wrong answers. Now drive the NaN + cross-type guards from the lowered exprs OUTPUT dtype (_expr_output_dtype, schema- only) instead of AST type inference — robustly catches division/functions. Replaces the three _infer_is_* helpers (DRY). - BLOCKER: list.contains was applied to ANY List column, so a user List property (n.tags = scalar) returned membership (wrong) vs pandas equality. Gated to the reserved labels column; other List columns decline honestly. - BLOCKER: numeric-vs-string nested in AllOf (x>20 AND x<z) or Between bypassed the cross-type guard and PANICKED (uncatchable Rust). _is_cross_type_predicate now recurses AllOf/Between. - IMPORTANT: Categorical/Enum columns now treated as string-like in both cross-type guards (categorical-vs-numeric was a raw ComputeError). - IMPORTANT: all-null columns (typed String by from_pandas) crashed on arithmetic (n.val + 1); cross-type guard now covers arithmetic ops, not just comparison. - IMPORTANT: ISO-temporal comparison guard narrowed to ORDERING of two temporal literals (was declining valid string-column-vs-date-literal compares; = and <> are lexicographically correct so not declined). - Anchored the temporal-constructor scan regex (no false-positive on update(...)). - Added the missing CHANGELOG entry (OPTIONAL MATCH null-fill decline) + streaming comment clarity. Full TCK still 3834 passed / 0 wrong-answers / 387 honest declines. 457 polars tests pass. 6 new adversarial regression tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… OTel span placement Adversarial review of the lower stack layers (#1648 polars engine, #1652 generic fast paths) found 2 bugs: - BLOCKER (#1648): a chain crashed under engine=polars with SchemaError when an edge endpoint dtype differed from the node-id dtype across int<->float (e.g. a null in a source/dest column -> float64 vs int64 ids) where pandas joins fine. The hop aligns join keys; the chain fast paths + combine did not. Added _align_edge_endpoints (cast endpoints to node-id dtype for the traversal, restore output dtype to match pandas; no-op when dtypes match) wired into the single-hop fast path + multi-hop. - (#1652): the gfql.chain OTel @otel_traced decorator had landed on the internal _try_chain_fast_path probe (inserted between the decorator and def chain) instead of the public chain() — chain() lost its span, span recorded wrong fn/attrs. Moved it. Both verified + regression-tested. 457 polars tests + 334 generic chain tests pass (4 fails are the pre-existing local libnvrtc CUDA-env issue, not these changes). Row-order divergences the review also found (fast path returns table order vs the full machinery BFS-discovery order, for reverse/undirected without ORDER BY — Cypher- undefined, sets/values identical, no repo test depends on it) are claim-precision, not bugs; CHANGELOG wording to be tightened in the per-layer #1648/#1652 review pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ype + DRY/docs
Multi-dimension adversarial review (correctness/robustness/quality/docs) of the
native polars engine found three reachable pandas-oracle divergences, now fixed
(NO-CHEATING — match pandas or decline honestly):
- BLOCKER: duplicate alias [n('a'), e(), n('a')] returned a malformed colliding-
join schema (a/a_right) instead of raising; now raises GFQLValidationError E201
like pandas (node/edge aliases scoped separately, mirroring combine_steps).
- BLOCKER: integer-literal division 5/2 lowered to polars true division (2.5) but
Cypher folds to int division (2) — silent wrong order when embedded non-
monotonically (ORDER BY n.val % (10/4)); now declines (NIE). Column / int (Float
on both) unaffected.
- IMPORTANT: internal start_nodes seed with a divergent id dtype (empty crossfilter
-> float64 vs int64 node ids) crashed the combine join (SchemaError); now aligns
the seed key (_align_seed_dtype), mirroring the hop + edge-endpoint alignment.
Quality/docs:
- Removed stale 'pandas bridge' docstrings/comments (row_pipeline, projection) —
the bridge was removed in the de-cheat commit; the code raises NIE.
- DRY: consolidated the cross-type/NaN dtype classifiers (numeric/int/float/
stringlike), duplicated 4x, into engine_polars/dtypes.py (the guard contract).
- Aligned the lazy hop allowed_source guard textually with the eager hop (no-op:
to_fixed_point is NIE'd upstream) to stop future eager/lazy drift.
- Removed two dead imports in chain.py (hop_polars, Engine).
- Documented a narrow filter_by_dict genuine-NaN residual (unreachable on the
from_pandas ingestion path that null's NaN).
+3 regression tests. 457 polars+chain tests pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Native CPU Polars execution engine for GFQL (
Engine.POLARS, opt-in viaengine='polars') — combines the former two-PR split (hop/chain traversals + cypher row pipeline) into one cohesive CPU-engine PR. The production pandas/cuDF paths are untouched;engine='auto'with Polars input still coerces to pandas as before.Traversals —
hop()/chain()Native vectorized BFS via semi/anti joins (no per-row Python). Forward/reverse/undirected single-hop, directed multi-hop chains, node/edge filter dicts and predicates (lowered to
pl.Expr),edge_match/source_node_match/destination_node_match,target_wave_front, alias names. Deferred (honestNotImplementedError): variable-length/multi-hop edges, undirected edges in multi-edge chains, hop labels, nodequery=.Cypher row pipeline —
MATCH … RETURNNO CHEATING: every query runs natively on Polars or raises an honest
NotImplementedErrorpointing atengine='pandas'— never a silent pandas bridge. Native: frame ops (rows/limit/skip/distinct/drop_cols), select/with_/return_ projection (cypher-expr-AST →pl.Expr: property/arithmetic/comparison/boolean/literal +coalesce/abs),where_rows(OR/NOT WHERE, Kleene 3-valued), order_by, group_by (count/sum/avg/min/max), unwind (literal cross-join), property/expr result projection, int/string/bool entity-text (pl.concat_str). Honestly deferred → NIE: cross-entity same-path WHERE, multi-entity binding_ops, float/temporal/nested entity-text, exotic exprs.Conformance hardening
Driven by the cross-repo Cypher TCK differential (pandas-vs-polars). Every fix either matches pandas natively or declines honestly (NO-CHEATING — no silent bridge, no wrong answer):
0.0/0.0 > 1etc.: Polars treats NaN as the largest value; now masked to the IEEE/pandas answer (& ~is_nanfor ordering/=,| is_nanfor<>), driven by the lowered expression's output dtype (robustly covers int/int→float division + function results).n.val > 'a') — would Rust-panic; now NIE (recurses intoAllOf/Between, covers Categorical/Enum + arithmetic on all-null→String columns).time({...}) > time({...}),a.time + duration({...})) — were lexicographic/string-concat wrong answers; now NIE (narrowed to ordering of two temporal literals;=/<>stay native).ArrowInvalid.UNION DISTINCT— engine-awareEngine.df_unique(Polarsunique(maintain_order=True)) instead of the pandas-onlydrop_duplicatescrash.OPTIONAL MATCH— absent whole-entity rendersnull(not'()'); null-row-fill alignment shape declines honestly (NIE) instead of a misleading validation error.null AND null,NOT null) — AND/OR/NOT cast topl.Booleanso Kleene logic evaluates instead of raising.labelsList column —list.contains(membership) instead of a List→String cast crash; user List properties decline honestly.filter_by_dictpredicate now raises NIE (was silently.to_pandas()-bridging); native lowering widened (AllOf,IsNull/NotNull, case-insensitiveSTARTS/ENDS WITH).WITH-scalarMATCHre-entry — engine-aware (pl.row/with_columns) instead of a pandas.iloccrash.POLARS_ENGINES = (Engine.POLARS,)is introduced here (the GPU target PR #1655 extends it) so the engine-aware helpers are self-contained at this layer.Validation
Differential parity vs the pandas engine (hop + chain suites + seeded fuzzer + a TCK-style cypher conformance lane with NULL/3-valued-logic + a
DEFERREDlist asserting deferred queries raise rather than bridge). Fullgraphistry/tests/compute/gfql/suite green (incl. the 1610-test cypher dir).Cross-repo Cypher TCK, polars arm: the differential pandas-vs-polars lane is clean — 0 wrong-answers across the full TCK (every scenario either matches pandas or honestly declines; 387 honest declines). This is the headline correctness guarantee: the Polars engine never silently disagrees with pandas.
Perf (interleaved, 1M nodes, each engine on its native-frame graph, all native)
Polars wins 5.6–38× across the surface:
RETURN n~38×,ORDER BY~17×, traversals 6–7.5×, projections/aggregations/DISTINCT5.6–6.9×. Plus eager fast paths (node-only / single-hop / unconstrained 1-hop) that move the polars>pandas crossover below ~100K for real viz/crossfilter shapes.(Supersedes the former PR2 #1649, folded in here.)
🤖 Generated with Claude Code