Skip to content

feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655

Open
lmeyerov wants to merge 2 commits into
dev/gfql-polars-enginefrom
dev/gfql-lazy-gpu
Open

feat(gfql): polars-gpu = GPU target of the lazy Polars engine (collect-once) [redo of #1654]#1655
lmeyerov wants to merge 2 commits into
dev/gfql-polars-enginefrom
dev/gfql-lazy-gpu

Conversation

@lmeyerov

@lmeyerov lmeyerov commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

GPU execution target of the lazy Polars engine (engine='polars-gpu') — the redo of the per-op GPU PR (#1654, which was a perf regression) as a thin target of the lazy engine on #1648. engine='polars' and engine='polars-gpu' are now one lazy engine, two targets: build a single deferred pl.LazyFrame plan per single-hop and collect_all it once on CPU or GPU.

Why this replaces #1654

Benchmark showed per-op eager GPU collect was a regression (each op re-transfers tables H2D). collect-once is the fix: single-hop GPU is now a 2.84× win @1m (vs eager) with CPU parity. The win flows from the lazy engine on #1648; this PR just selects the GPU target.

Design (contained)

  • Engine.POLARS_GPU='polars-gpu', explicit opt-in only (AUTO never selects it); frames stay pl.DataFrame (treated like POLARS in all frame ops). Extends POLARS_ENGINES (introduced on feat(gfql): native lazy Polars engine — collect-once traversals + cypher row pipeline #1648) to include the GPU target — so every engine-aware helper covers it for free.
  • Dispatch wraps the lazy hop/chain in target_mode(GPU) (the lazy/ framework's collect/collect_all already run on the active target). engine='polars' (CPU) is byte-for-byte unchanged.
  • raise_on_fail=False — GPU-incapable nodes stay on CPU in Polars (no pandas bridge; NO-CHEATING). Uses the cudf-polars in-memory executor (executor="in-memory") — faster + more stable than the default streaming engine="gpu" for in-device-memory GFQL results.

Also here: opt-in CPU streaming collect

GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects on the streaming executor (~1.04–1.11× faster on large multi-hop traversals, parity-identical), default off because small/interactive sizes regress (~0.86× from streaming overhead). No change to default behavior.

Honest scope

Single-hop GPU wins; the chain-level GPU win currently dilutes (a chain runs forward+backward = 2 hop collects + eager _combine_*). Fusing those + moving combine onto the target is the next benchmark-driven opt on this PR.

Validation (dgx, RAPIDS --gpus all)

Stacks on #1648 (lazy Polars engine) → #1652 (general opts) → master. Supersedes #1654 (per-op GPU).

⚠️ This branch was force-rewritten on 2026-06-28 (restack moved the CPU conformance fixes down to #1648). If you have it checked out: git fetch && git reset --hard origin/dev/gfql-lazy-gpu.

🤖 Generated with Claude Code

@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from dcce5fa to be04687 Compare June 27, 2026 15:11
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from d93976c to a7df2f2 Compare June 27, 2026 15:11
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch from be04687 to aefd073 Compare June 27, 2026 16:35
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch 6 times, most recently from d8b2074 to a684248 Compare June 27, 2026 17:35
@lmeyerov lmeyerov force-pushed the dev/gfql-polars-engine branch 2 times, most recently from 35f65b5 to e9f29bd Compare June 27, 2026 18:12
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch 4 times, most recently from ea0cf19 to 5529138 Compare June 28, 2026 07:34
lmeyerov and others added 2 commits June 28, 2026 01:03
Redo of the per-op GPU engine (#1654, a perf regression) as a TARGET of the lazy
engine: engine='polars-gpu' runs the same single-deferred-plan + collect-once on
the cudf_polars GPU backend. Tiny wiring on top of the lazy engine — the lazy/
framework already does target-aware collect.

- Engine.POLARS_GPU = 'polars-gpu' + POLARS_ENGINES; explicit opt-in (AUTO never
  picks it); frames stay pl.DataFrame (treated like POLARS in frame ops).
- compute/{hop,chain}.py dispatch: engine in (POLARS, POLARS_GPU) -> wrap the lazy
  call in target_mode(GPU if POLARS_GPU else CPU). ComputeMixin + gfql_unified
  same-path WHERE accept POLARS_GPU. engine='polars' (CPU) byte-for-byte unchanged.
- raise_on_fail=False (GPU-incapable nodes stay on CPU in polars; no pandas bridge).

dgx: parity engine='polars-gpu' == engine='polars' (test_engine_polars_gpu.py 36
passed); full gfql suite 2921 passed, 0 failed. Single-hop GPU 2.84x @1m (vs the
per-op regression); chain-level GPU win currently dilutes (fwd+bwd 2 collects +
eager combine) -> next opt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(P-B)

GFQL_POLARS_CPU_STREAMING=1 runs the polars-CPU lazy collects (hop/chain) on the
streaming executor. Benchmarked (dgx, interleaved A/B, parity-identical): ~1.11x at
10M nodes/80M edges (20.0->18.0s), ~1.04x at 1M, but ~0.86x (slower) at 100K — the
streaming overhead loses on small/interactive sizes. So default OFF (behavior
unchanged); opt-in for large batch traversals.

From the blogpost perf-opt handoff item B (polars-CPU heavy-join scaling). The full
streaming win in isolation is larger (80M 2-hop semijoin 1669->1040ms, 1.6x); the
real chain dilutes it via the forward/backward/combine overhead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lmeyerov lmeyerov force-pushed the dev/gfql-lazy-gpu branch from 5529138 to 194671a Compare June 28, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant