feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654] by lmeyerov · Pull Request #1655 · graphistry/pygraphistry

lmeyerov · 2026-06-27T02:06:05Z

Summary

GPU execution target of the lazy Polars engine (engine='polars-gpu') — the redo of the per-op GPU PR (#1654, which was a perf regression) as a thin target of the lazy engine on #1648. engine='polars' and engine='polars-gpu' are now one lazy engine, two targets: build a single deferred pl.LazyFrame plan per single-hop and collect_all it once on CPU or GPU.

Why this replaces #1654

Benchmark showed per-op eager GPU collect was a regression (each op re-transfers tables H2D). collect-once is the fix: single-hop GPU is now a 2.84× win @1m (vs eager) with CPU parity. The win flows from the lazy engine on #1648; this PR just selects the GPU target.

Design (contained)

Engine.POLARS_GPU='polars-gpu', explicit opt-in only (AUTO never selects it); frames stay pl.DataFrame (treated like POLARS in all frame ops). Extends POLARS_ENGINES (introduced on feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648) to include the GPU target — so every engine-aware helper covers it for free.
Dispatch wraps the lazy hop/chain in target_mode(GPU) (the lazy/ framework's collect/collect_all already run on the active target). engine='polars' (CPU) is byte-for-byte unchanged.
raise_on_fail=False — GPU-incapable nodes stay on CPU in Polars (no pandas bridge; NO-CHEATING). Uses the cudf-polars in-memory executor (executor="in-memory") — faster + more stable than the default streaming engine="gpu" for in-device-memory GFQL results.

Also here: opt-in CPU streaming collect

GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects on the streaming executor (~1.04–1.11× faster on large multi-hop traversals, parity-identical), default off because small/interactive sizes regress (~0.86× from streaming overhead). No change to default behavior.

Honest scope

Single-hop GPU wins; the chain-level GPU win currently dilutes (a chain runs forward+backward = 2 hop collects + eager _combine_*). Fusing those + moving combine onto the target is the next benchmark-driven opt on this PR.

Validation (dgx, RAPIDS `--gpus all`)

Parity engine='polars-gpu' == engine='polars': test_engine_polars_gpu.py (skips without cudf_polars).
Full graphistry/tests/compute/gfql/ polars suite green; CPU engine='polars' is byte-for-byte unchanged from feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648 (the GPU target adds only the 2 GPU/streaming commits on top).

Stacks on #1648 (lazy Polars engine) → #1652 (general opts) → master. Supersedes #1654 (per-op GPU).

⚠️ This branch was force-rewritten on 2026-06-28 (restack moved the CPU conformance fixes down to #1648). If you have it checked out: git fetch && git reset --hard origin/dev/gfql-lazy-gpu.

🤖 Generated with Claude Code

@1m

Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(P-B) GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects (hop/chain) on the streaming executor. Benchmarked (dgx, interleaved A/B, parity-identical): ~1.11x at 10M nodes/80M edges (20.0->18.0s), ~1.04x at 1M, but ~0.86x (slower) at 100K — the streaming overhead loses on small/interactive sizes. So default OFF (behavior unchanged); opt-in for large batch traversals. From the blogpost perf-opt handoff item B (polars-CPU heavy-join scaling). The full streaming win in isolation is larger (80M 2-hop semijoin 1669->1040ms, 1.6x); the real chain dilutes it via the forward/backward/combine overhead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

lmeyerov force-pushed the dev/gfql-polars-engine branch from dcce5fa to be04687 Compare June 27, 2026 15:11

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from d93976c to a7df2f2 Compare June 27, 2026 15:11

lmeyerov force-pushed the dev/gfql-polars-engine branch from be04687 to aefd073 Compare June 27, 2026 16:35

lmeyerov force-pushed the dev/gfql-lazy-gpu branch 6 times, most recently from d8b2074 to a684248 Compare June 27, 2026 17:35

lmeyerov force-pushed the dev/gfql-polars-engine branch 2 times, most recently from 35f65b5 to e9f29bd Compare June 27, 2026 18:12

lmeyerov force-pushed the dev/gfql-lazy-gpu branch 4 times, most recently from ea0cf19 to 5529138 Compare June 28, 2026 07:34

lmeyerov mentioned this pull request Jun 28, 2026

feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648

Open

lmeyerov and others added 2 commits June 28, 2026 01:03

lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 5529138 to 194671a Compare June 28, 2026 08:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655

feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655
lmeyerov wants to merge 2 commits into
dev/gfql-polars-enginefrom
dev/gfql-lazy-gpu

lmeyerov commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lmeyerov commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this replaces #1654

Design (contained)

Also here: opt-in CPU streaming collect

Honest scope

Validation (dgx, RAPIDS --gpus all)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lmeyerov commented Jun 27, 2026 •

edited

Loading

Validation (dgx, RAPIDS `--gpus all`)