Skip to content

perf(qdp): remove mid-stream D2H norm validation from batch GPU-pointer encoding paths #1370

@ryankert01

Description

@ryankert01

Summary

The batch amplitude encoding paths (encode_batch_from_gpu_ptr_f32, encode_batch_from_gpu_ptr) perform a blocking CPU round-trip for norm validation between the norm kernel and the encode kernel. This serializes two GPU kernel launches unnecessarily and is the primary reason Mahout's encode-only throughput falls behind PyTorch's reference implementation.

Current pipeline (batch path)

launch_l2_norm_batch_f32   (GPU, queued)
↓
cudaStreamSynchronize      ← STALL #1
dtoh_sync_copy(norms)      ← PCIe transfer: N floats CPU-bound
CPU: check norms[i] for zero/NaN
↓
launch_amplitude_encode_batch_f32  (GPU, queued)
↓
cudaStreamSynchronize      ← STALL #2

Two syncs + one D2H copy per batch, even when all norms are valid (which is almost always).

Benchmark evidence

Running benchmark_pytorch_ref.py --qubits 16 --batches 200 --batch-size 64 (encode-only mode — both frameworks start with data already on GPU):

Framework Throughput
PyTorch GPU 228,825 vec/s
Mahout 66,615 vec/s
Ratio 0.3x (Mahout 3.4x slower)

PyTorch's amplitude_encode uses torch.linalg.vector_norm + data / norms.clamp(min=1e-10) — everything stays on GPU, no CPU validation.

In end-to-end mode (data gen + H2D + encode), Mahout is 7.8x faster than PyTorch because the full pipeline cost dominates. The gap only appears when isolating kernel work.

Related prior work

The same pattern was already fixed for the single-sample encode_from_gpu_ptr_f32 path in this branch: a new launch_amplitude_encode_f32_device_norm CUDA kernel reads inv_norm from device memory, letting the norm kernel and encode kernel chain on the same stream with a single sync at the end. That change yielded 1.22x speedup on the single-sample path (39.7 → 32.6 µs/sample at 16 qubits).

The batch path has the same structure and would benefit from the same treatment, but the decision of whether to remove or make-optional the norm validation is a policy question for the community.

The trade-off

Removing the D2H validation:

  • Pros: eliminates the mid-stream stall entirely; closes the gap with PyTorch; norms stay on device as float* inv_norms_d, both kernels chain on same stream, one sync at end
  • Cons: zero-norm or NaN inputs silently produce all-zero state vectors instead of returning an error; caller must validate upstream

Making validation optional (e.g. validate_norms: bool flag, default false):

  • Pros: preserves the safety net for development/debug use; hot path is fast
  • Cons: API complexity; easy to forget to enable for debugging

Keeping status quo:

  • Norm validation is a correctness guarantee at the encoder boundary
  • Useful during development when data pipelines are untested

Questions for the community

  1. Is norm validation at the encoder level a hard requirement, or should it be the caller's responsibility?
  2. Should the batch GPU-pointer paths (encode_batch_from_gpu_ptr_*) be treated as "internal / trusted input" paths where validation is skipped, while the host-data paths (encode, encode_batch) retain validation?
  3. Would a strict=False default (skip validation, fast path) with strict=True opt-in (current behavior) be acceptable?

Related issues / PRs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions