Skip to content

feat(gfql): Polars-GPU engine (engine='polars-gpu', cudf_polars) [PR3, stacked on #1648]#1654

Closed
lmeyerov wants to merge 2 commits into
dev/gfql-polars-enginefrom
dev/gfql-polars-gpu
Closed

feat(gfql): Polars-GPU engine (engine='polars-gpu', cudf_polars) [PR3, stacked on #1648]#1654
lmeyerov wants to merge 2 commits into
dev/gfql-polars-enginefrom
dev/gfql-polars-gpu

Conversation

@lmeyerov

Copy link
Copy Markdown
Contributor

Summary

GPU execution mode of the native Polars engine (Engine.POLARS_GPU, opt-in via engine='polars-gpu') — turns the validated PR4 spike/loop-probe into a formal engine, stacked on the CPU Polars engine (#1648). The same vectorized ops run, but the hot traversal joins materialize on GPU via the RAPIDS cudf_polars backend (LazyFrame.collect(engine=pl.GPUEngine(raise_on_fail=False))).

Design (contained execution mode — not a frame type)

  • Engine.POLARS_GPU = 'polars-gpu', explicit opt-in onlyengine='auto' never selects it.
  • Frames stay pl.DataFrame (handled exactly like POLARS in all frame ops via POLARS_ENGINES); only the collect boundary changes.
  • GPU intent is carried by a context var (engine_polars/gpu.py) set at the chain/hop dispatch boundary, so the engine internals don't thread a gpu flag. When GPU is inactive the join helper is the ordinary eager joinengine='polars' (CPU) is byte-for-byte unchanged.
  • raise_on_fail=False keeps any GPU-incapable node on CPU in Polars — NOT a pandas bridge (still honest/native; NO-CHEATING).

Scope (first slice)

GPU-routes the hop semi-joins + the final edge/node materialization (the loop-probe's proven win). The cypher row-pipeline ops run on CPU-Polars for now (still correct, still native) — extending GPU coverage there is the next increment.

Validation (dgx, RAPIDS container --gpus all)

  • Differential parity engine='polars-gpu' == engine='polars' across the cypher conformance corpus + core traversals — test_engine_polars_gpu.py, 36 passed (skips when no cudf_polars/GPU). GPU semi-join confirmed running on GPU (raise_on_fail=True probe).
  • Full graphistry/tests/compute/gfql/ suite unchanged: 2885 passed, 0 failed (engine='polars' untouched).

Stacks on #1648 (CPU Polars engine) → #1652 (general opts) → master.

🤖 Generated with Claude Code

lmeyerov and others added 2 commits June 26, 2026 16:02
GPU execution mode of the native Polars engine: same vectorized ops, but the hot
traversal joins materialize on GPU via the RAPIDS cudf_polars backend
(LazyFrame.collect(engine=pl.GPUEngine(raise_on_fail=False))). Turns the validated
PR4 spike/loop-probe into a formal engine stacked on the CPU Polars engine.

- Engine.POLARS_GPU = 'polars-gpu', explicit opt-in only (AUTO never selects it);
  frames stay pl.DataFrame (POLARS_ENGINES handled like POLARS in all frame ops).
- GPU intent carried by a context var (engine_polars/gpu.py) set at the chain/hop
  dispatch boundary, so engine='polars' (CPU) is byte-for-byte unchanged: when GPU
  is inactive, the join helper is the ordinary eager join.
- raise_on_fail=False keeps GPU-incapable nodes on CPU IN POLARS — not a pandas
  bridge (still honest/native; NO-CHEATING).
- First slice GPU-routes the hop semi-joins + final edge/node materialization; the
  row pipeline runs CPU-Polars for now (still correct/native), to extend next.

Validated on dgx (RAPIDS --gpus all): differential parity engine='polars-gpu' ==
engine='polars' across the cypher conformance corpus + traversals
(test_engine_polars_gpu.py, 36 passed; skips with no cudf_polars). Full gfql suite
unchanged: 2885 passed, 0 failed (engine='polars' untouched).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
select / where_rows / order_by / group_by.agg / unwind cross-join now run on GPU
when POLARS_GPU is active (via gpu.{select,where,sort,group_agg,join} helpers that
lazy+collect on pl.GPUEngine), else the ordinary eager op (CPU unchanged).

dgx: parity engine='polars-gpu' == engine='polars' across the corpus exercised
THROUGH GPU (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0
failed. Entity-text rendering still CPU-Polars (next).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov

Copy link
Copy Markdown
Contributor Author

Superseded by the lazy-GPU PR (GPU as a target of the lazy engine on #1648). The per-op approach here was a perf regression (repeated H2D); the lazy collect-once version is a 2.84× single-hop GPU win @1m with CPU parity. Branch kept for reference.

@lmeyerov lmeyerov closed this Jun 27, 2026
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 27, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lmeyerov added a commit that referenced this pull request Jun 28, 2026
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant