Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm

### Changed
- **GFQL Cypher parse memoization (perf)**: `parse_cypher` now memoizes its result (LRU over the deterministic lark parse+transform → immutable frozen AST). Repeated identical Cypher queries skip the ~15 ms parse — the dominant per-call cost of small queries (~50% of a Cypher call at 100k rows) — making end-to-end query latency ~1.3–1.7× faster at small/interactive sizes across pandas/polars/cuDF. Safe to share the cached AST: every Cypher AST node is `@dataclass(frozen=True)` and `compile_cypher_query` does not mutate the parsed tree; validation errors still raise and are not cached.
- **GFQL structured whole-entity returns (#1650)**: Terminal Cypher `RETURN a` (whole node/edge) now emits **structured flattened columns** (`a.id`, `a.val`, `a.kind`, ...) instead of a single Cypher display string (`({id: 51, val: 51, kind: 'a'})`). The per-field columns already exist before projection, so this is "stop collapsing" rather than "rebuild": measured ~2–6.4× faster on pandas and ~2.7–4.3× on cuDF for whole-entity returns (the win grows with row count, since the old text render is O(rows) and the flat form is ~free), and the result is directly usable without re-parsing a string and survives JSON/CSV/Parquet/Arrow serialization and `plot()`. The human-readable Cypher display string remains available on demand via the `render_entity_text(result, alias)` presentation helper. OPTIONAL-MATCH / `WITH`-reentry / grouping paths that synthesize null/absent entities or still consume a single-column entity value are unchanged. Behavior change: callers that previously read the rendered display string from a terminal `RETURN a` column now receive flattened `a.*` columns. Edge case: a whole entity with NO fields to flatten — an entity with no id binding, no properties, and no type/label (in practice only an edge whose graph has no edge-id binding) — has no `{alias}.{field}` columns to emit, so it falls back to the single Cypher-display-text column under the bare alias (value is correct, e.g. `[]`); nodes always carry their id field and always flatten.

### Performance
- **GFQL temporal-detection dtype gate (#1650)**: `order_detect_temporal_mode` now short-circuits for numeric/bool/complex columns, which can never hold temporal *text*, instead of running an `astype(str)` + multi-regex `fullmatch` scan on every comparison. Eliminates spurious row-wise stringification in `where_rows`/comparison paths whose output never contains entity-text. Byte-identical results; measured `where_rows` speedups ~3.1× (pandas) and ~4.4–13.3× (cuDF, scaling with row count). Does not address whole-entity `RETURN a` text rendering, which is tracked separately.
Expand Down
27 changes: 27 additions & 0 deletions docs/source/gfql/cypher.rst
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,33 @@ Row And Row-Pipeline Forms
including connected suffix projections in the current supported row-binding
subset.

Whole-Entity RETURN Output Shape
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A terminal ``RETURN`` of a whole node or relationship (``RETURN a`` rather than
``RETURN a.prop``) emits **structured flattened columns**, one per field, named
``<alias>.<field>``::

g.gfql("MATCH (a:Person) RETURN a")
# result._nodes columns: a.id, a.name, a.age, ... (one column per field)

This is directly usable (no string to re-parse) and survives JSON / CSV / Parquet /
Arrow serialization and ``plot()``. To recover the human-readable Cypher display
string (``(:Person {name: 'Alice'})``) on demand, use the presentation helper::

from graphistry.compute.gfql.cypher.result_postprocess import render_entity_text
text_series = render_entity_text(result, "a") # nodes
text_series = render_entity_text(result, "r", table="edges") # relationships

Notes:

- An aliased property projection of the same field (``RETURN a, a.val``) is
de-duplicated — you get a single ``a.val`` column, not two.
- A whole entity with no fields to flatten (no id binding, no properties, no
type/label — in practice only an edge whose graph has no edge-id binding) has
nothing to flatten and falls back to a single Cypher-display-text column under the
bare alias. Nodes always carry an id field and always flatten.

Procedure And Multi-Branch Forms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
11 changes: 4 additions & 7 deletions graphistry/compute/chain.py
Original file line number Diff line number Diff line change
Expand Up @@ -1097,13 +1097,10 @@ def _chain_impl(
)
if added_edge_index:
final_edges_df = final_edges_df.drop(columns=[g._edge])
# Rebuild from `self` to restore the ORIGINAL edge binding (`self._edge`,
# often None — `g` carries the internal edge-index binding instead), but
# explicitly carry the materialized node-id binding `g._node`: for an
# edges-only input `self._node is None`, so rebuilding from `self` alone
# drops it, leaving the endpoint-reconciliation concat below to synthesize
# a `None`-named column (corrupt result + a void-block concat crash on
# newer pandas).
# `self` restores the original edge binding, but carry the materialized
# `g._node` explicitly: an edges-only `self._node is None` would drop the
# node binding, making the reconciliation concat synthesize a corrupt
# `None`-named column (and a void-block concat crash on newer pandas).
g_out = self.nodes(final_nodes_df, g._node).edges(final_edges_df, edge=original_edge)
else:
g_out = g.nodes(final_nodes_df).edges(final_edges_df)
Expand Down
170 changes: 158 additions & 12 deletions graphistry/compute/gfql/cypher/result_postprocess.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from __future__ import annotations

from dataclasses import replace
from typing import Any, Dict, Literal, Optional, TypedDict, cast
from typing import Any, Dict, List, Literal, Optional, Set, TypedDict, cast

import pandas as pd

Expand Down Expand Up @@ -116,6 +116,105 @@ def _format_edge_entities(df: DataFrameT, projection: ResultProjectionPlan) -> S
)


def _label_flag_columns(df: DataFrameT) -> list[str]:
return [
str(col)
for col in df.columns
if str(col).startswith("label__")
and str(col).split("label__", 1)[1] not in {"<NA>", "None", "nan"}
]


def _flat_entity_field_names(
source_rows_df: DataFrameT, projection: ResultProjectionPlan, id_column: Optional[str]
) -> list[str]:
"""Ordered field names for a flattened whole-entity projection (#1650).

Mirrors the renderer's column selection (``node_property_columns`` /
``edge_property_columns`` honor ``exclude_columns`` so sibling aliases and
engine-internal columns are not pulled in), then prepends the entity id and
(nodes) appends ``label__*`` flags / (edges) the ``type`` column so
:func:`render_entity_text` can losslessly reconstruct the Cypher form.
"""
alias_col = projection.alias
if projection.table == "nodes":
prop_cols = node_property_columns(source_rows_df, alias_col, projection.exclude_columns)
# label sources for faithful reconstruction: label__* flags and/or the
# node ``type`` column (both consumed by _node_label_text).
extra = _label_flag_columns(source_rows_df)
if "type" in source_rows_df.columns:
extra = [*extra, "type"]
else:
prop_cols = edge_property_columns(source_rows_df, alias_col, projection.exclude_columns)
extra = ["type"] if "type" in source_rows_df.columns else []

fields: list[str] = []
for col in [id_column, *prop_cols, *extra]:
if col is not None and col in source_rows_df.columns and col not in fields:
fields.append(str(col))
return fields


def _flat_entity_columns(
source_rows_df: DataFrameT,
projection: ResultProjectionPlan,
output_name: str,
id_column: Optional[str],
) -> Dict[str, SeriesT]:
"""Structured (flattened) whole-entity projection (issue #1650).

Emit one ``{output_name}.{field}`` column per aliased field instead of
collapsing the entity into a single Cypher display string. The per-field
columns already exist on ``source_rows_df`` (gathered by
``_projection_alias_rows``), so this is "stop collapsing", not "rebuild":
near-free, lossless, and directly usable without re-parsing a string.
"""
return {
f"{output_name}.{field}": cast(SeriesT, source_rows_df[field])
for field in _flat_entity_field_names(source_rows_df, projection, id_column)
}


def render_entity_text(
result: Plottable, alias: str, *, table: Literal["nodes", "edges"] = "nodes"
) -> SeriesT:
"""Render a structured whole-entity projection back to Cypher display text.

Presentation helper: given a result whose ``RETURN <alias>`` was emitted as
flattened ``{alias}.{field}`` columns (the default since #1650), reconstruct
the Cypher display string (``(:Label {..})`` / ``[:TYPE {..}]``). Used by the
conformance/TCK driver and by callers who want the human-readable form. The
structured data path itself never pays this cost.
"""
rows_df = cast(DataFrameT, result._nodes)
if rows_df is None:
raise ValueError("result has no _nodes frame to render")
prefix = f"{alias}."
field_cols = [col for col in rows_df.columns if str(col).startswith(prefix)]
if not field_cols:
raise ValueError(f"no flattened columns found for alias {alias!r}")
frame = cast(
DataFrameT,
rows_df[field_cols].rename(columns={col: str(col)[len(prefix):] for col in field_cols}),
)
# An OPTIONAL-MATCH miss flattens to a row whose fields are all null; such
# rows must render as null, not "()". Track presence (any field non-null).
present: Optional[SeriesT] = None
for field in frame.columns:
not_na = cast(SeriesT, frame[field].notna())
present = not_na if present is None else cast(SeriesT, present | not_na)
# _format_*_entities anchors length/null on a bare alias column; render every
# row, then null absent rows below.
frame = cast(DataFrameT, frame.assign(**{alias: True}))
projection = ResultProjectionPlan(alias=alias, table=table, columns=(), exclude_columns=())
rendered = _format_node_entities(frame, projection) if table == "nodes" else _format_edge_entities(frame, projection)
if present is not None and hasattr(rendered, "where"):
# Null absent rows. ``other=None`` fills NaN/None (valid pandas/cuDF);
# the pandas-stubs ``where`` overload is stricter than runtime here.
rendered = cast(SeriesT, rendered.where(present, None)) # type: ignore[call-overload]
return rendered


def _project_property_column(
rows_df: DataFrameT,
*,
Expand All @@ -124,10 +223,8 @@ def _project_property_column(
if column.source_name is None or column.source_name not in rows_df.columns:
raise ValueError(f"projection source column not found: {column.source_name!r}")
series = cast(SeriesT, rows_df[column.source_name])
# Temporal-constructor normalization only applies to STRING values; numeric/bool/
# complex columns can never hold temporal text, so skip the (otherwise spurious)
# ``astype(str)`` + detection scan and return the column as-is — byte-identical,
# since the scan returns None for these dtypes. Mirrors the #1650/#1651 gate.
# Temporal-constructor normalization only applies to strings; numeric/bool/complex
# can't hold temporal text, so skip the astype(str)+scan (byte-identical). #1650 gate.
if is_non_textual_scalar_dtype(getattr(series, "dtype", None)):
return series
if hasattr(series, "astype") and hasattr(cast(SeriesT, series.astype(str)), "str"):
Expand Down Expand Up @@ -185,7 +282,17 @@ def _projection_alias_rows(
return None


def apply_result_projection(result: Plottable, projection: ResultProjectionPlan) -> Plottable:
def apply_result_projection(
result: Plottable, projection: ResultProjectionPlan, *, structured: bool = True
) -> Plottable:
"""Project Cypher RETURN columns onto ``result._nodes``.

``structured=True`` (#1650 default) emits whole-entity returns as flattened
``{alias}.{field}`` columns. ``structured=False`` keeps the legacy single
Cypher-display-string column; the reentry / OPTIONAL-MATCH null-fill machinery
(which still assumes a single-column entity value) opts out via this flag until
it is unified onto the structured path.
"""
rows_df = cast(DataFrameT, getattr(result, "_nodes", None))
if rows_df is None:
return result
Expand All @@ -194,27 +301,54 @@ def apply_result_projection(result: Plottable, projection: ResultProjectionPlan)
return result
projected_data: Dict[str, SeriesT] = {}
projected_entity_meta: Dict[str, WholeRowProjectionMeta] = {}
output_columns: list[str] = []
for column in projection.columns:
if column.kind == "whole_row":
source_alias = column.source_name or projection.alias
source_rows_df = _projection_alias_rows(rows_df, alias=source_alias)
if source_rows_df is None or source_alias not in source_rows_df.columns:
raise ValueError(f"whole-row projection source alias not found: {source_alias!r}")
source_projection = projection if source_alias == projection.alias else replace(projection, alias=source_alias)
projected_data[column.output_name] = (
_format_node_entities(source_rows_df, source_projection)
if projection.table == "nodes"
else _format_edge_entities(source_rows_df, source_projection)
)
id_column = getattr(result, "_node" if source_projection.table == "nodes" else "_edge", None)
flat_columns = (
_flat_entity_columns(source_rows_df, source_projection, column.output_name, id_column)
if structured
else {}
)
if structured and flat_columns:
# Structured (flattened) emission (#1650): one column per field; text
# stays available via render_entity_text().
projected_data.update(flat_columns)
output_columns.extend(flat_columns.keys())
elif structured:
# No fields to flatten: the synthesized absent-entity row (OPTIONAL miss
# / reentry no-match, a single ``{alias: None}`` column) or a field-less
# real entity. Emit the single-column text form (renders to None / []).
projected_data[column.output_name] = (
_format_node_entities(source_rows_df, source_projection)
if source_projection.table == "nodes"
else _format_edge_entities(source_rows_df, source_projection)
)
output_columns.append(column.output_name)
else:
projected_data[column.output_name] = (
_format_node_entities(source_rows_df, source_projection)
if source_projection.table == "nodes"
else _format_edge_entities(source_rows_df, source_projection)
)
output_columns.append(column.output_name)
if id_column is not None and id_column in source_rows_df.columns:
projected_entity_meta[column.output_name] = {
"table": source_projection.table,
"alias": source_projection.alias,
"id_column": id_column,
# Snapshot the id Series: the bounded-reentry path recovers
# carried node identities from this meta and must not alias the
# live working frame (see #1356).
"ids": cast(SeriesT, source_rows_df[id_column]).copy(),
}
else:
output_columns.append(column.output_name)
if column.kind == "property":
property_rows_df = alias_rows_df
if (
Expand All @@ -226,14 +360,26 @@ def apply_result_projection(result: Plottable, projection: ResultProjectionPlan)
projected_data[column.output_name] = _project_property_column(property_rows_df, column=column)
else:
projected_data[column.output_name] = _project_expr_column(result, rows_df, column=column)
# De-dup output columns (#1650): a flattened whole entity `a` (-> a.id, a.val, ...)
# collides by name with an explicit property projection (`RETURN a, a.val`). Both
# read the same source field (dotted aliases are rejected), so values are identical
# — keep first occurrence; a duplicate name would drop data on to_dict/serialization.
if len(set(output_columns)) != len(output_columns):
seen: Set[str] = set()
deduped: List[str] = []
for c in output_columns:
if c not in seen:
seen.add(c)
deduped.append(c)
output_columns = deduped
projected_rows = alias_rows_df
if rows_df.__class__.__module__.startswith("cudf") and any(isinstance(value, pd.Series) for value in projected_data.values()):
projected_rows = cast(DataFrameT, cast(Any, alias_rows_df).to_pandas())
projected_data = {
key: cast(SeriesT, value.to_pandas() if hasattr(value, "to_pandas") else value)
for key, value in projected_data.items()
}
projected_nodes = cast(DataFrameT, projected_rows.assign(**projected_data)[[column.output_name for column in projection.columns]])
projected_nodes = cast(DataFrameT, projected_rows.assign(**projected_data)[output_columns])

out = result.bind()
out._nodes = projected_nodes
Expand Down
11 changes: 10 additions & 1 deletion graphistry/compute/gfql_unified.py
Original file line number Diff line number Diff line change
Expand Up @@ -856,7 +856,15 @@ def _execute_compiled_query_chain_non_union(
empty_result_row=compiled_query.empty_result_row,
)
if compiled_query.result_projection is not None:
result = apply_result_projection(result, compiled_query.result_projection)
# OPTIONAL null-fill / row-guard still consumes a single-column entity value,
# so those keep the legacy text form; plain terminal RETURN flattens (#1650).
structured_projection = (
compiled_query.optional_projection_row_guard is None
and compiled_query.optional_null_fill is None
)
result = apply_result_projection(
result, compiled_query.result_projection, structured=structured_projection
)
if compiled_query.optional_projection_row_guard is not None:
expected_rows = 1
for base_chain in compiled_query.optional_projection_row_guard.base_chains:
Expand Down Expand Up @@ -892,6 +900,7 @@ def _execute_compiled_query_chain_non_union(
context,
),
compiled_query.optional_null_fill.alignment_projection,
structured=False,
)
result = _apply_optional_null_fill(
result,
Expand Down
Loading
Loading