feat(gfql #1650): structured flattened whole-entity returns#1656
Open
lmeyerov wants to merge 25 commits into
Open
feat(gfql #1650): structured flattened whole-entity returns#1656lmeyerov wants to merge 25 commits into
lmeyerov wants to merge 25 commits into
Conversation
…eat queries) parse_cypher's lark parse+transform is the dominant per-call cost of small cypher queries — ~50% of a cypher call at 100k rows (PR7b layer profiling). It is a pure function of the query text returning an immutable frozen AST, so memoize it (LRU 512): repeated identical queries skip the parse, ~1.3–1.7x faster end-to-end at small/interactive sizes across pandas/polars/cudf. Safe: every cypher AST node is @DataClass(frozen=True) and compile_cypher_query does not mutate the parsed tree (verified); the non-str/empty validation guard stays outside the cache so error behavior is unchanged and errors are not cached. - split validating guard (parse_cypher) from cached body (_parse_cypher_cached) - full gfql suite green on dgx: 2875 passed, 0 failed; +2 cache regression tests NOTE: independent of the polars engine (base cypher code) — cherry-pickable to its own PR off master. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ngification (#1650) `order_detect_temporal_mode` ran `astype(str)` + up to 6 regex `fullmatch` passes on BOTH operands of every comparison, including numeric/bool columns that can never hold temporal *text*. This fired ~160k spurious `re.fullmatch` on a 10k-row `where_rows(val > 50)` whose output is a plain table — pure waste. Gate: early-return None when `dtype.kind in {i,u,f,b,c}`. Object/string/datetime columns are unaffected, so temporal-text detection is preserved. Byte-identical output. Measured `where_rows` speedups: - pandas @10k: 48.9 -> 15.6 ms (3.1x) - cuDF @10k: 37.4 -> 8.5 ms (4.4x) - cuDF @100k: 191.0 -> 14.4 ms (13.3x, 1M edges) Tests: pandas 91 + cuDF 83 temporal/ordering/row suites pass; adds unit coverage for the gate (numeric/bool skip, object temporal still detected) and an end-to-end numeric `where_rows` filter assertion. Scope: this is the spurious-path half of #1650. Whole-entity `RETURN a` text rendering (structured/Arrow returns) is tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on scans Follows #1650/#1651 (which gated temporal-text detection). Profiling numeric RETURN/ORDER BY queries @100k showed ~40–48% of the call was STILL spurious `astype(str)` + regex on int/float/bool columns — not in temporal detection (already gated) but in the LIST-detection + projection scans: - order_detect_list_series / order_detect_stringified_list_series (ordering.py) - RowPipelineMixin._gfql_series_is_list_like (pipeline.py) - _project_property_column temporal scan (result_postprocess.py) All run `series.astype(str)` to test for list/temporal *text* on every column, including numeric/bool/complex columns that can never hold it. Gate them with a shared `_is_non_listable_dtype` (kind in i,u,f,b,c) → early return False/series. Byte-identical (the scans return False/None for these dtypes anyway); object/ string/datetime columns are untouched, preserving list/temporal detection. Result (where the projection/order path runs over numeric cols): ORDER BY and arithmetic-projection drop their ~50–100ms spurious string render to ~0. + tests: numeric/bool dtype pass-through; behavioral numeric ORDER BY (would fail if values were lexically stringified). Base cypher/row code -> PR0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a27ddf4 to
f5f325a
Compare
…chain isin() is set-membership, so isin(s) == isin(s.unique()) — the explicit .unique()/dedup pass is redundant. Removed at _filter_edges_by_endpoint, the undirected combine masks, and the per-hop wavefront filter. Each removed .unique() is one fewer kernel launch on GPU (where launch latency, not compute, dominates small/mid traversals). Byte-identical: verified across 345 adversarial graph×query cases (dup/parallel edges, self-loops, isolated/dead-end nodes, cycles, undirected, multi-hop, fixed-point, min/max hops, names, filters, seeds) + the hop/chain test suites. cuDF 1-hop chain ~126→103 ms @1m (~18%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f5f325a to
d6903ae
Compare
A single MATCH (n) (no edge hop) — the dominant tabular/crossfilter shape (node filter, histogram, table search) — returns the filtered node table directly + empty edges, skipping the generic forward/backward/combine BFS. Helps pandas (default engine) + cuDF. Byte-identical (345-case adversarial golden + hop/chain suites, 240 tests). ~100x faster pandas node filter at 10M (204->2ms), cuDF stays on the resident frame (~0.8ms @1m). The 1-hop shape is NOT generic-fast-pathed: the full machinery's node merge upcasts int->float (pandas artifact), so a join-free generic 1-hop would change dtypes; the polars engine has its own dtype-consistent 1-hop fast path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d6903ae to
e946275
Compare
Extend the generic node-only fast path to the simple 1-hop MATCH (a {f})-[e]->(b)
(the basic graph query + "filter then expand" crossfilter): return the edges
whose endpoints pass the node filters + those endpoint nodes, skipping the
forward/backward/combine BFS. Gated to pandas/cuDF (dask/spark keep the full
path; polars has its own). Same values + node/edge sets (345-case adversarial
golden = dtype-only diffs, values identical; no test regressions vs pristine).
Behavior change (intentional, more correct): the 1-hop now PRESERVES node-attr
dtypes (int stays int) instead of the full machinery's spurious int->float merge
upcast — making pandas/cuDF consistent with the polars engine (which already
preserved int). cuDF basic graph query now ~2 semi-joins on the resident frame
instead of the BFS + ~31 drop_duplicates, capturing the GPU semijoin win.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e946275 to
12db9a3
Compare
The generic _try_chain_fast_path short-circuited chain() before the policy hook dispatch, so a query that hits the fast path (node-only MATCH, single-hop) never fired prechain/postchain (and skipped per-op policy inspection) — observable behavior change vs the pre-fast-path engine. Gate the fast path on no-policy; policy-bearing queries take the full path. 64 policy tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
12db9a3 to
eca4217
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
…tructured returns Squashed reconciliation of the native lazy Polars GFQL engine (was #1648's 28 commits; full history preserved at tag bak/1648) restacked onto the colleague's #1656 structured whole-entity returns + #1657 parse_expr memo. Engine: native polars hop/chain (semi/anti joins), native cypher row pipeline (select/where/order_by/group_by/unwind/projection), lazy single-hop collect-once with CPU/GPU execution targets (gfql/lazy/). NO pandas bridge — native or honest NotImplementedError (plan.md NO-CHEATING). Reconciliation with #1650 structured returns: apply_result_projection now threads `structured` to the polars path (apply_result_projection_polars). Whole-entity RETURN a flattens to {alias}.{field} columns natively (mirrors the pandas _flat_entity_field_names selection exactly), which — unlike the legacy entity-text expr — works for ANY dtype (float/temporal/nested just become columns), so polars structured == pandas structured across the board. structured=False still renders the native Cypher display string for int/string/bool single-entity nodes. _include_numeric_id_as_property is now polars-aware so id flattens identically. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
Per the #1656 author's handoff: the elif-structured single-column text fallback in _apply_result_projection_pandas looks redundant but fixes two regressions (top-level OPTIONAL-MATCH miss; OPTIONAL-WITH-reentry no-match). Mark DO NOT REMOVE so a later 'tidy' doesn't reintroduce them. Our polars structured-returns reconciliation touched this file; verified the fallback is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c40b69c to
7d77ce7
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
…tructured returns Squashed reconciliation of the native lazy Polars GFQL engine (was #1648's 28 commits; full history preserved at tag bak/1648) restacked onto the colleague's #1656 structured whole-entity returns + #1657 parse_expr memo. Engine: native polars hop/chain (semi/anti joins), native cypher row pipeline (select/where/order_by/group_by/unwind/projection), lazy single-hop collect-once with CPU/GPU execution targets (gfql/lazy/). NO pandas bridge — native or honest NotImplementedError (plan.md NO-CHEATING). Reconciliation with #1650 structured returns: apply_result_projection now threads `structured` to the polars path (apply_result_projection_polars). Whole-entity RETURN a flattens to {alias}.{field} columns natively (mirrors the pandas _flat_entity_field_names selection exactly), which — unlike the legacy entity-text expr — works for ANY dtype (float/temporal/nested just become columns), so polars structured == pandas structured across the board. structured=False still renders the native Cypher display string for int/string/bool single-entity nodes. _include_numeric_id_as_property is now polars-aware so id flattens identically. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
Per the #1656 author's handoff: the elif-structured single-column text fallback in _apply_result_projection_pandas looks redundant but fixes two regressions (top-level OPTIONAL-MATCH miss; OPTIONAL-WITH-reentry no-match). Mark DO NOT REMOVE so a later 'tidy' doesn't reintroduce them. Our polars structured-returns reconciliation touched this file; verified the fallback is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7d77ce7 to
bd75bf9
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
…tructured returns Squashed reconciliation of the native lazy Polars GFQL engine (was #1648's 28 commits; full history preserved at tag bak/1648) restacked onto the colleague's #1656 structured whole-entity returns + #1657 parse_expr memo. Engine: native polars hop/chain (semi/anti joins), native cypher row pipeline (select/where/order_by/group_by/unwind/projection), lazy single-hop collect-once with CPU/GPU execution targets (gfql/lazy/). NO pandas bridge — native or honest NotImplementedError (plan.md NO-CHEATING). Reconciliation with #1650 structured returns: apply_result_projection now threads `structured` to the polars path (apply_result_projection_polars). Whole-entity RETURN a flattens to {alias}.{field} columns natively (mirrors the pandas _flat_entity_field_names selection exactly), which — unlike the legacy entity-text expr — works for ANY dtype (float/temporal/nested just become columns), so polars structured == pandas structured across the board. structured=False still renders the native Cypher display string for int/string/bool single-entity nodes. _include_numeric_id_as_property is now polars-aware so id flattens identically. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
Per the #1656 author's handoff: the elif-structured single-column text fallback in _apply_result_projection_pandas looks redundant but fixes two regressions (top-level OPTIONAL-MATCH miss; OPTIONAL-WITH-reentry no-match). Mark DO NOT REMOVE so a later 'tidy' doesn't reintroduce them. Our polars structured-returns reconciliation touched this file; verified the fallback is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bd75bf9 to
a83a59b
Compare
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
…tructured returns Squashed reconciliation of the native lazy Polars GFQL engine (was #1648's 28 commits; full history preserved at tag bak/1648) restacked onto the colleague's #1656 structured whole-entity returns + #1657 parse_expr memo. Engine: native polars hop/chain (semi/anti joins), native cypher row pipeline (select/where/order_by/group_by/unwind/projection), lazy single-hop collect-once with CPU/GPU execution targets (gfql/lazy/). NO pandas bridge — native or honest NotImplementedError (plan.md NO-CHEATING). Reconciliation with #1650 structured returns: apply_result_projection now threads `structured` to the polars path (apply_result_projection_polars). Whole-entity RETURN a flattens to {alias}.{field} columns natively (mirrors the pandas _flat_entity_field_names selection exactly), which — unlike the legacy entity-text expr — works for ANY dtype (float/temporal/nested just become columns), so polars structured == pandas structured across the board. structured=False still renders the native Cypher display string for int/string/bool single-entity nodes. _include_numeric_id_as_property is now polars-aware so id flattens identically. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov
added a commit
that referenced
this pull request
Jun 28, 2026
Per the #1656 author's handoff: the elif-structured single-column text fallback in _apply_result_projection_pandas looks redundant but fixes two regressions (top-level OPTIONAL-MATCH miss; OPTIONAL-WITH-reentry no-match). Mark DO NOT REMOVE so a later 'tidy' doesn't reintroduce them. Our polars structured-returns reconciliation touched this file; verified the fallback is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The generic single-hop fast path accepted edges with `prune_to_endpoints=True` (a public e()/e_forward()/e_reverse() kwarg) but returned BOTH endpoints, while the flag keeps only the arrival side (dest for forward, src for reverse). So `[n(), e_forward(prune_to_endpoints=True), n()]` silently returned the wrong node/edge set (verified: path 0→1→2→3 gave [0,1,2,3] vs correct [1,2,3]). Add the flag to the fast-path edge gate so it declines and falls through to the full path, which honors it. `is_simple_single_hop()` is intentionally left unchanged (its other callers depend on the current semantics) — the gate is the correct, surgical layer. Also document the deep-immutability invariant on the Cypher AST nodes: parse_cypher results are shared by reference via lru_cache, so a future mutable field would poison cache hits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a83a59b to
231526a
Compare
Code-quality pass on the opt-base fast path: - Type the contract boundaries: _filter_edges_by_endpoint(edges_df: DataFrameT, nodes_df: Optional[DataFrameT], ...) -> DataFrameT; _try_chain_fast_path(g_in: Plottable, ops: List[ASTObject], engine_concrete: Engine, start_nodes: Optional[DataFrameT]). - Replace getattr(n0, "query", None) duck-typing with n0.query (ASTNode declares it; the isinstance(n0, ASTNode) guard narrows the type). - Convert Engine -> EngineAbstract once (engine_abs) for the filter_by_dict calls (its param is EngineAbstract|str), matching the existing materialize conversion. - Trim verbose comments to terse one/two-liners (fast-path docstring, .unique removal, prune gate, policy gate, temporal dtype gate, AST immutability invariant). No behavior change; mypy + ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Complements the Cypher whole-query parse memo in PR #1652. Once parse_cypher is cached, profiling the residual fixed per-call compile cost showed it is dominated by parse_expr() — the GFQL row-expression parser, invoked ~4x per query compile (each RETURN/WHERE/WITH expression) and, per call, rebuilding a Lark transformer that defines a frozen dataclass via runtime exec (dataclasses._process_class / _create_fn / exec churn). parse_expr() is a pure function of the expression string (no params/schema) and returns a tree of frozen dataclasses (17 frozen node types, tuple-valued fields, immutable). So identical expressions — re-parsed on every compile, and recurring across queries (e.g. `a.val > 50`) — are memoized via lru_cache(maxsize=1024). Non-str/empty guard stays outside the cache; only successful parses are cached. No source consumer outside graphistry/compute/gfql calls parse_expr, and nothing bypasses frozen-ness (no object.__setattr__). Measured (dgx-spark, median-of-9): stacked on the query-parse memo, the fixed per-call cost of a repeated string Cypher query drops a further ~4.5 ms -> ~1.8 ms (RETURN a @100 rows: ~6.3 -> 3.6 ms) — near parity with the equivalent native chain. Touches only expr_parser.py (+ its test) — disjoint from PR #1652's parser.py, so the two apply independently in either order. Tests: 3 focused expr-cache tests (memoization identity, distinct + hit registration, invalid non-caching). Full graphistry/tests/compute/gfql/: 2512 passed, 16 skipped, 15 xfailed; only 2 unrelated in-container artifacts fail (networkx setup.py packaging; a cugraph "without cugraph" shortest-path test in an image that has cugraph). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a regression test for the fix in 2755744 (fast path must not skip policy hooks): asserts that for the exact fast-path-eligible shapes (`[n()]` and `[n(),e(),n()]`) the prechain/postchain/postload hooks all fire when a policy is installed, and that the fast path is still taken when no policy is present. Without a guard this regression was silent (the shapes that hit the fast path are precisely the ones the policy suite exercises). Also documents the policy-gate in the fast-path CHANGELOG entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chain fast path (_try_chain_fast_path) shipped with only incidental coverage (existing hop/chain/golden suites); its core behaviors were never pinned directly. Adds focused, engine-parametrized locks: - test_fast_path_differential_parity_vs_full_path: fast path output == full (policy-forced BFS) path output by node/edge SET, across every accelerated shape (node-only, node-filter, predicate, fwd/rev/undirected 1-hop, src/dst/both-end filters) AND every bypass shape (hops=2, filtered-undirected, edge_match, named node). Uses an installed no-op policy as the equivalence oracle (any policy forces the full path). - test_fast_path_preserves_int_node_dtypes: hard-locks the feature promise (1-hop fast path keeps int node attrs as int); tolerant on the full path. - test_fast_path_gating_returns_none_for_ineligible: unit-level accept/decline contract for the gate (seeds, non-eager engines, edge_match, named/queried nodes, hops>1, 2-op chains all decline). - _gfql_series_is_list_like dtype-gate: direct unit coverage (numeric/bool short-circuit; real list/tuple objects still detected; stringified lists intentionally deferred to order_detect_stringified_list_series). pandas lanes run everywhere; cuDF lanes gated on TEST_CUDF=1 (dgx-spark). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The new _try_chain_fast_path helper was inserted between the
@otel_traced("gfql.chain") decorator and def chain(), so the decorator
silently migrated onto the fast-path helper: chain() stopped emitting its
span entirely, and _chain_otel_attrs (which expects chain's signature) was
invoked with the fast path's positional args — binding `gfql.validate_schema`
to the start_nodes DataFrame. Only surfaces with OTEL enabled (attrs are
computed lazily), so CI stayed green. Move the decorator back onto chain().
Adds test_chain_otel_span_attrs_mapped_correctly: enables otel via monkeypatch,
captures the span, and asserts chain_len + that validate_schema is a bool
(fails under the misplaced decorator, passes with it on chain()).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The #1650/#1651 dtype gate (numeric/bool/complex columns skip the spurious astype(str)+regex list/temporal scan) was duplicated across 4 sites with 3 different sentinel conventions ("O", None, factored). Consolidate the kind set into is_non_textual_scalar_dtype() in series_str_compat.py (the engine-poly Series-string layer these gates short-circuit) and call it from ordering.py (x2), pipeline.py, and cypher/result_postprocess.py. Byte-identical; adds a direct parametrized unit test for the helper across dtype kinds + None. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 1-hop fast path took `edges = g._edges` wholesale and only intersected
filtered endpoints with the node table for the NODE output — it never
restricted the EDGES to those whose endpoints exist in the node table. The
full BFS path enforces that implicitly via its edge<->node joins, so the two
diverged whenever a supplied node table omits an edge endpoint (filtered/subset
node frames, or nodes+edges loaded separately):
nodes v=[0,1], edges (s,d)=[(0,1),(1,99)] # 99 absent
[n(), e_forward(hops=1), n()] fast kept (1,99); full dropped it
[n({'attr':2}), e_forward, n()] fast -> nodes[1] edges[(1,99)]; full -> EMPTY
Restrict edges to both-endpoints-present before the endpoint filters, and dedup
node ids on the result (the full path collapses dup rows via its merge). Now
fast == full across dangling-endpoint and duplicate-id graphs; no change on
well-formed graphs. Found by wave-2 differential review; tests added for both.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The wave-2 both-endpoints-present restriction used `.isin(node_ids)`, but pandas/cuDF `.isin` treat NaN as a matchable value (`NaN.isin([NaN])` is True). So a NaN node id "validated" a NaN edge endpoint and the fast path kept a dangling NaN edge that the full BFS path's joins (which never match NaN<->NaN) correctly drop. dropna() on node_ids before the isin restores convergence for forward/reverse on NaN-id graphs; well-formed and non-null-dangling graphs are unchanged. (NaN ids are malformed input; the full path's undirected branch is itself inconsistent there, so we match its directed/consistent behavior.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s, dup ids, NaN) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d; expr AST immutability note Hardening tests for the fast-path + dtype-gate optimizations: - prune_to_endpoints: add to the gating-decline list and as bypass shapes (fwd+rev) in the fast-vs-full differential — regression guard for the fast-path prune gate. - NaN endpoints: new test_fast_path_drops_nan_endpoint_edges pins that a NaN node id does not validate a NaN edge endpoint (the .dropna() guard), matching the full BFS path. - Tighten the differential test: bypass shapes now also assert they DECLINE the fast path (previously full-vs-full, vacuous for bypass cases). - Drift guard: assert filter_by_dict._is_numeric_dtype_safe agrees with the series_str_compat._NON_TEXTUAL_SCALAR_KINDS SSOT across all numpy dtype kinds (the two keep separate copies to avoid an import cycle; they must not desync). - cuDF lane for the numeric where_rows dtype-gate path (paired-coverage convention). Also document the deep-immutability invariant on ExprNode (parse_expr lru_cache shares results by reference). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…te (#1657) Code-quality pass on the parse-cache layer: - Condense the verbose dropna/dangling-edge and dup-node fast-path comments to terse one/two-liners; same for the SSOT dtype-kind note and the ExprNode immutability invariant. - Note in the SSOT comment that filter_by_dict's mirror is kept in sync by a test (not imported) to avoid an import cycle. No behavior change; mypy + ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Terminal Cypher `RETURN a` (whole node/edge) previously emitted one column of
Cypher display strings (`({id: 51, val: 51, kind: 'a'})`) built row-wise. The
string is a *presentation* format (it matches the cypher-shell / TCK oracle),
not data — callers had to re-parse it to use it, and constructing it is O(rows).
This flattens whole-entity returns into structured `{alias}.{field}` columns
(`a.id, a.val, a.kind`, ...) by default. The per-field columns already exist on
the working frame before projection, so this is "stop collapsing", not
"rebuild": near-free, lossless, directly usable, and it survives JSON / CSV /
Parquet / Arrow serialization and `plot()`.
Measured (dgx-spark, median-of-7, RETURN a vs old text form):
pandas @100k 32 vs 204 ms (6.4x); cuDF @100k 27 vs 114 ms (4.3x). Win grows
with row count (text render is O(rows); flat is ~free).
Design:
- `apply_result_projection(..., structured=True)` emits flat columns for
whole-entity returns; `structured=False` keeps the legacy single
Cypher-display-string column. The OPTIONAL-MATCH null-fill / projection
row-guard paths (which still consume a single-column entity value for row
alignment) opt out via this flag and are unchanged.
- A synthesized null/absent-entity row (top-level OPTIONAL-MATCH miss or
OPTIONAL WITH-reentry no-match, built by `_apply_empty_result_row` as a
single `{alias: None}` column) has no field columns to flatten, so it falls
back to the single-column text form — rendering to None and preserving the
shape the OPTIONAL / reentry machinery consumes for identity recovery and
no-match detection. Real rows always carry flat fields and flatten.
- Text is now presentation-only: `render_entity_text(result, alias)`
reconstructs the Cypher display string on demand (used by the conformance /
TCK driver and any caller wanting the human-readable form). The structured
data path never pays the render cost.
- The entity-projection meta `ids` snapshot (`.copy()`) is retained — bounded
reentry recovers carried node identities from it and must not alias the live
frame (#1356).
Tests: whole-entity text assertions migrated to a `entity_text_records` shim
that renders flat -> text for comparison against the pre-#1650 Cypher-text
oracle; grouping / connected-optional / null_fill paths (still single-column
text) keep direct text assertions; flat-shape + render-helper + meta tests
added. gfql/cypher + row suites: 1646 passed, 15 xfailed (only the unrelated
in-container networkx setup.py packaging artifact fails).
Cross-repo follow-ups (separate, after this lands): tck-gfql conformance
adapter (structured -> text at the comparison hook) and pyg-bench probes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rm (#1650) Structured whole-entity returns (#1650) changed terminal RETURN a from the display string '(:person)' to flattened a.* columns. Two test_gfql.py tests still asserted the old display-string form (missed when the behavior landed); update them to the flattened columns. The behavior itself is correct + documented in CHANGELOG. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…policy-hook guard test) Root-cause fix for a real bug surfaced while hardening the fast-path/policy work: A chain/Cypher query over an edges-only graph (no node-id binding, `g._node is None`) takes the full traversal path — the node-only/single-hop fast path is skipped whenever an edge-id index must be synthesized (`added_edge_index`), or when a policy is attached. That path rebuilt `g_out` from `self` (the *unbound* input) to restore the original edge binding, but `self._node is None`, so the materialized node-id binding was dropped. The endpoint-reconciliation concat then synthesized a spurious `None`-named node column — a corrupt result on older pandas, and a hard `NotImplementedError` (void-block NA fill) on newer pandas (Python 3.14). Fix: carry the materialized `g._node` binding explicitly (`self.nodes(final_nodes_df, g._node)`) while still restoring the original edge binding. Full-path output now matches the fast path: correct single node column, valid `_node`, edge binding preserved. Pre-existing on master (not introduced by this PR's fast paths). Also adds a parametrized regression test pinning that BOTH chain fast-path shapes (node-only `[n()]` and single-hop `[n(),e(),n()]`) still fire `postload` under a policy AND return a valid, non-corrupt result on an edges-only graph — guarding both the policy-hook gate (2755744) and this node-binding fix. CHANGELOG: documents both. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on; doc + tests + cuDF lane
- I1 (bug): `RETURN a, a.val` emitted a duplicate `a.val` column (the whole-entity flatten
shares the `{alias}.{field}` namespace with the explicit property projection). Duplicate
column names break selection and silently drop data on `to_dict`/serialization. De-dup the
output columns (identical data — dotted aliases are rejected), keeping first occurrence.
- I2 (boundary): document that a whole entity with no flattenable field (no id binding, no
props, no type — in practice only an edge with no edge-id binding) falls back to the single
Cypher-display-text column (value correct, e.g. `[]`); nodes always carry an id and flatten.
Pinned by a test; nodes-immune noted.
- Docs: new "Whole-Entity RETURN Output Shape" section in cypher.rst (flat columns,
render_entity_text helper, dedup + no-field boundary).
- Tests: dup-column + no-field regression tests; cuDF lane for the edges-only/policy fast-path
shapes test (chain.py is on the cuDF-pairing list).
- CHANGELOG: I1 Fixed entry; structured-returns entry notes the no-field boundary; edges-only
Fixed wording corrected (the fast path isn't gated on edge-id synthesis).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code-quality pass on the structured-returns layer: - render_entity_text reads result._nodes directly (typed Plottable attr) instead of getattr duck-typing. - Condense verbose comments to terse one/two-liners: structured/absent emission branch, the dedup rationale, the temporal dtype gate, the OPTIONAL opt-out note, and the edges-only node-binding rebuild. No behavior change; mypy + ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
231526a to
b8c169e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the core of #1650. Terminal Cypher
RETURN a(whole node/edge) previously emitted one column of Cypher display strings built row-wise (({id: 51, val: 51, kind: 'a'})). That string is a presentation format (it matches the cypher-shell / TCK oracle), not data — callers had to re-parse it to use it, and constructing it is O(rows).This flattens whole-entity returns into structured
{alias}.{field}columns (a.id, a.val, a.kind, ...) by default. The per-field columns already exist on the working frame before projection, so this is "stop collapsing", not "rebuild": near-free, lossless, directly usable, and it survives JSON / CSV / Parquet / Arrow serialization andplot().Performance (dgx-spark, median-of-7,
RETURN avs old text form)Win grows with row count (text render is O(rows); flat is ~free).
Design
apply_result_projection(..., structured=True)emits flat columns for whole-entity returns;structured=Falsekeeps the legacy single Cypher-display-string column. The OPTIONAL-MATCH null-fill / projection row-guard paths (which still consume a single-column entity value for row alignment) opt out via this flag and are unchanged.WITH-reentry no-match, built by_apply_empty_result_rowas a single{alias: None}column) has no field columns to flatten, so it falls back to the single-column text form — rendering toNoneand preserving the shape the OPTIONAL / reentry machinery consumes for identity recovery and no-match detection. Real rows always carry flat fields and flatten.render_entity_text(result, alias)reconstructs the Cypher display string on demand (used by the conformance/TCK driver and any caller wanting the human-readable form). The structured data path never pays the render cost.idssnapshot (.copy()) is retained — bounded reentry recovers carried node identities from it and must not alias the live frame ([BUG] Cypher reentry path mis-handles OPTIONAL prefix MATCH on no-match fixtures #1356).Behavior change
Callers that previously read the rendered Cypher display string from a terminal
RETURN acolumn now receive flatteneda.*columns. Documented under[Development] › Changedin CHANGELOG. No programmatic consumer of the display string was found (graphistry server / louie serialize to flat JSON/CSV/Parquet).Tests
entity_text_recordsshim that renders flat → text for comparison against the pre-GFQL: avoid spurious entity-text stringification of returned entities (return structured/Arrow frames) #1650 Cypher-text oracle.graphistry/tests/compute/gfql/on dgx-spark (cuDF 25.12 container): 2509 passed, 16 skipped, 15 xfailed. The only 2 failures are pre-existing container-environment artifacts unrelated to this change (image's stale bakedsetup.py; a cugraph test that asserts "without cugraph" in an image that has cugraph).Follow-ups (separate PRs, after this lands)
a.*columns at the comparison hook (paired-contract).RETURN astructured-vs-text +where_rowsmicro-probes.🤖 Generated with Claude Code