feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655
Open
lmeyerov wants to merge 2 commits into
Open
feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655lmeyerov wants to merge 2 commits into
lmeyerov wants to merge 2 commits into
Conversation
dcce5fa to
be04687
Compare
d93976c to
a7df2f2
Compare
be04687 to
aefd073
Compare
d8b2074 to
a684248
Compare
35f65b5 to
e9f29bd
Compare
ea0cf19 to
5529138
Compare
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/ framework already does target-aware collect. - Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never picks it); frames stay pl.DataFrame (treated like POLARS in frame ops). - compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged. - raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge). dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36 passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects + eager combine) -> next opt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(P-B) GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects (hop/chain) on the streaming executor. Benchmarked (dgx, interleaved A/B, parity-identical): ~1.11x at 10M nodes/80M edges (20.0->18.0s), ~1.04x at 1M, but ~0.86x (slower) at 100K — the streaming overhead loses on small/interactive sizes. So default OFF (behavior unchanged); opt-in for large batch traversals. From the blogpost perf-opt handoff item B (polars-CPU heavy-join scaling). The full streaming win in isolation is larger (80M 2-hop semijoin 1669->1040ms, 1.6x); the real chain dilutes it via the forward/backward/combine overhead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5529138 to
194671a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GPU execution target of the lazy Polars engine (
engine='polars-gpu') — the redo of the per-op GPU PR (#1654, which was a perf regression) as a thin target of the lazy engine on #1648.engine='polars'andengine='polars-gpu'are now one lazy engine, two targets: build a single deferredpl.LazyFrameplan per single-hop andcollect_allit once on CPU or GPU.Why this replaces #1654
Benchmark showed per-op eager GPU collect was a regression (each op re-transfers tables H2D). collect-once is the fix: single-hop GPU is now a 2.84× win @1m (vs eager) with CPU parity. The win flows from the lazy engine on #1648; this PR just selects the GPU target.
Design (contained)
Engine.POLARS_GPU='polars-gpu', explicit opt-in only (AUTO never selects it); frames staypl.DataFrame(treated likePOLARSin all frame ops). ExtendsPOLARS_ENGINES(introduced on feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648) to include the GPU target — so every engine-aware helper covers it for free.target_mode(GPU)(thelazy/framework'scollect/collect_allalready run on the active target).engine='polars'(CPU) is byte-for-byte unchanged.raise_on_fail=False— GPU-incapable nodes stay on CPU in Polars (no pandas bridge; NO-CHEATING). Uses the cudf-polars in-memory executor (executor="in-memory") — faster + more stable than the default streamingengine="gpu"for in-device-memory GFQL results.Also here: opt-in CPU streaming collect
GFQL_POLARS_CPU_STREAMING=1runs the polars-CPU lazy collects on the streaming executor (~1.04–1.11× faster on large multi-hop traversals, parity-identical), default off because small/interactive sizes regress (~0.86× from streaming overhead). No change to default behavior.Honest scope
Single-hop GPU wins; the chain-level GPU win currently dilutes (a chain runs forward+backward = 2 hop collects + eager
_combine_*). Fusing those + moving combine onto the target is the next benchmark-driven opt on this PR.Validation (dgx, RAPIDS
--gpus all)engine='polars-gpu' == engine='polars':test_engine_polars_gpu.py(skips without cudf_polars).graphistry/tests/compute/gfql/polars suite green; CPUengine='polars'is byte-for-byte unchanged from feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648 (the GPU target adds only the 2 GPU/streaming commits on top).Stacks on #1648 (lazy Polars engine) → #1652 (general opts) → master. Supersedes #1654 (per-op GPU).
🤖 Generated with Claude Code