feat(gfql): Polars-GPU engine (engine='polars-gpu', cudf_polars) [PR3, stacked on #1648] by lmeyerov · Pull Request #1654 · graphistry/pygraphistry

lmeyerov · 2026-06-26T23:02:48Z

Summary

GPU execution mode of the native Polars engine (Engine.POLARS_GPU, opt-in via engine='polars-gpu') — turns the validated PR4 spike/loop-probe into a formal engine, stacked on the CPU Polars engine (#1648). The same vectorized ops run, but the hot traversal joins materialize on GPU via the RAPIDS cudf_polars backend (LazyFrame.collect(engine=pl.GPUEngine(raise_on_fail=False))).

Design (contained execution mode — not a frame type)

Engine.POLARS_GPU = 'polars-gpu', explicit opt-in only — engine='auto' never selects it.
Frames stay pl.DataFrame (handled exactly like POLARS in all frame ops via POLARS_ENGINES); only the collect boundary changes.
GPU intent is carried by a context var (engine_polars/gpu.py) set at the chain/hop dispatch boundary, so the engine internals don't thread a gpu flag. When GPU is inactive the join helper is the ordinary eager join → engine='polars' (CPU) is byte-for-byte unchanged.
raise_on_fail=False keeps any GPU-incapable node on CPU in Polars — NOT a pandas bridge (still honest/native; NO-CHEATING).

Scope (first slice)

GPU-routes the hop semi-joins + the final edge/node materialization (the loop-probe's proven win). The cypher row-pipeline ops run on CPU-Polars for now (still correct, still native) — extending GPU coverage there is the next increment.

Validation (dgx, RAPIDS container `--gpus all`)

Differential parity engine='polars-gpu' == engine='polars' across the cypher conformance corpus + core traversals — test_engine_polars_gpu.py, 36 passed (skips when no cudf_polars/GPU). GPU semi-join confirmed running on GPU (raise_on_fail=True probe).
Full graphistry/tests/compute/gfql/ suite unchanged: 2885 passed, 0 failed (engine='polars' untouched).

Stacks on #1648 (CPU Polars engine) → #1652 (general opts) → master.

🤖 Generated with Claude Code

GPU execution mode of the native Polars engine: same vectorized ops, but the hot traversal joins materialize on GPU via the RAPIDS cudf_polars backend (LazyFrame.collect(engine=pl.GPUEngine(raise_on_fail=False))). Turns the validated PR4 spike/loop-probe into a formal engine stacked on the CPU Polars engine. - Engine.POLARS_GPU = 'polars-gpu', explicit opt-in only (AUTO never selects it); frames stay pl.DataFrame (POLARS_ENGINES handled like POLARS in all frame ops). - GPU intent carried by a context var (engine_polars/gpu.py) set at the chain/hop dispatch boundary, so engine='polars' (CPU) is byte-for-byte unchanged: when GPU is inactive, the join helper is the ordinary eager join. - raise_on_fail=False keeps GPU-incapable nodes on CPU IN POLARS — not a pandas bridge (still honest/native; NO-CHEATING). - First slice GPU-routes the hop semi-joins + final edge/node materialization; the row pipeline runs CPU-Polars for now (still correct/native), to extend next. Validated on dgx (RAPIDS --gpus all): differential parity engine='polars-gpu' == engine='polars' across the cypher conformance corpus + traversals (test_engine_polars_gpu.py, 36 passed; skips with no cudf_polars). Full gfql suite unchanged: 2885 passed, 0 failed (engine='polars' untouched). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

select / where_rows / order_by / group_by.agg / unwind cross-join now run on GPU when POLARS_GPU is active (via gpu.{select,where,sort,group_agg,join} helpers that lazy+collect on pl.GPUEngine), else the ordinary eager op (CPU unchanged). dgx: parity engine='polars-gpu' == engine='polars' across the corpus exercised THROUGH GPU (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Entity-text rendering still CPU-Polars (next). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lmeyerov · 2026-06-27T02:06:06Z

Superseded by the lazy-GPU PR (GPU as a target of the lazy engine on #1648). The per-op approach here was a perf regression (repeated H2D); the lazy collect-once version is a 2.84× single-hop GPU win @1m with CPU parity. Branch kept for reference.

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lmeyerov and others added 2 commits June 26, 2026 16:02

lmeyerov mentioned this pull request Jun 27, 2026

feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654] #1655

Open

lmeyerov closed this Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gfql): Polars-GPU engine (engine='polars-gpu', cudf_polars) [PR3, stacked on #1648]#1654

feat(gfql): Polars-GPU engine (engine='polars-gpu', cudf_polars) [PR3, stacked on #1648]#1654
lmeyerov wants to merge 2 commits into
dev/gfql-polars-enginefrom
dev/gfql-polars-gpu

lmeyerov commented Jun 26, 2026

Uh oh!

lmeyerov commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lmeyerov commented Jun 26, 2026

Summary

Design (contained execution mode — not a frame type)

Scope (first slice)

Validation (dgx, RAPIDS container --gpus all)

Uh oh!

lmeyerov commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Validation (dgx, RAPIDS container `--gpus all`)