Summary
The batch amplitude encoding paths (encode_batch_from_gpu_ptr_f32, encode_batch_from_gpu_ptr) perform a blocking CPU round-trip for norm validation between the norm kernel and the encode kernel. This serializes two GPU kernel launches unnecessarily and is the primary reason Mahout's encode-only throughput falls behind PyTorch's reference implementation.
Current pipeline (batch path)
launch_l2_norm_batch_f32 (GPU, queued)
↓
cudaStreamSynchronize ← STALL #1
dtoh_sync_copy(norms) ← PCIe transfer: N floats CPU-bound
CPU: check norms[i] for zero/NaN
↓
launch_amplitude_encode_batch_f32 (GPU, queued)
↓
cudaStreamSynchronize ← STALL #2
Two syncs + one D2H copy per batch, even when all norms are valid (which is almost always).
Benchmark evidence
Running benchmark_pytorch_ref.py --qubits 16 --batches 200 --batch-size 64 (encode-only mode — both frameworks start with data already on GPU):
| Framework |
Throughput |
| PyTorch GPU |
228,825 vec/s |
| Mahout |
66,615 vec/s |
| Ratio |
0.3x (Mahout 3.4x slower) |
PyTorch's amplitude_encode uses torch.linalg.vector_norm + data / norms.clamp(min=1e-10) — everything stays on GPU, no CPU validation.
In end-to-end mode (data gen + H2D + encode), Mahout is 7.8x faster than PyTorch because the full pipeline cost dominates. The gap only appears when isolating kernel work.
Related prior work
The same pattern was already fixed for the single-sample encode_from_gpu_ptr_f32 path in this branch: a new launch_amplitude_encode_f32_device_norm CUDA kernel reads inv_norm from device memory, letting the norm kernel and encode kernel chain on the same stream with a single sync at the end. That change yielded 1.22x speedup on the single-sample path (39.7 → 32.6 µs/sample at 16 qubits).
The batch path has the same structure and would benefit from the same treatment, but the decision of whether to remove or make-optional the norm validation is a policy question for the community.
The trade-off
Removing the D2H validation:
- Pros: eliminates the mid-stream stall entirely; closes the gap with PyTorch; norms stay on device as
float* inv_norms_d, both kernels chain on same stream, one sync at end
- Cons: zero-norm or NaN inputs silently produce all-zero state vectors instead of returning an error; caller must validate upstream
Making validation optional (e.g. validate_norms: bool flag, default false):
- Pros: preserves the safety net for development/debug use; hot path is fast
- Cons: API complexity; easy to forget to enable for debugging
Keeping status quo:
- Norm validation is a correctness guarantee at the encoder boundary
- Useful during development when data pipelines are untested
Questions for the community
- Is norm validation at the encoder level a hard requirement, or should it be the caller's responsibility?
- Should the batch GPU-pointer paths (
encode_batch_from_gpu_ptr_*) be treated as "internal / trusted input" paths where validation is skipped, while the host-data paths (encode, encode_batch) retain validation?
- Would a
strict=False default (skip validation, fast path) with strict=True opt-in (current behavior) be acceptable?
Related issues / PRs
Summary
The batch amplitude encoding paths (
encode_batch_from_gpu_ptr_f32,encode_batch_from_gpu_ptr) perform a blocking CPU round-trip for norm validation between the norm kernel and the encode kernel. This serializes two GPU kernel launches unnecessarily and is the primary reason Mahout's encode-only throughput falls behind PyTorch's reference implementation.Current pipeline (batch path)
Two syncs + one D2H copy per batch, even when all norms are valid (which is almost always).
Benchmark evidence
Running
benchmark_pytorch_ref.py --qubits 16 --batches 200 --batch-size 64(encode-only mode — both frameworks start with data already on GPU):PyTorch's
amplitude_encodeusestorch.linalg.vector_norm + data / norms.clamp(min=1e-10)— everything stays on GPU, no CPU validation.In end-to-end mode (data gen + H2D + encode), Mahout is 7.8x faster than PyTorch because the full pipeline cost dominates. The gap only appears when isolating kernel work.
Related prior work
The same pattern was already fixed for the single-sample
encode_from_gpu_ptr_f32path in this branch: a newlaunch_amplitude_encode_f32_device_normCUDA kernel readsinv_normfrom device memory, letting the norm kernel and encode kernel chain on the same stream with a single sync at the end. That change yielded 1.22x speedup on the single-sample path (39.7 → 32.6 µs/sample at 16 qubits).The batch path has the same structure and would benefit from the same treatment, but the decision of whether to remove or make-optional the norm validation is a policy question for the community.
The trade-off
Removing the D2H validation:
float* inv_norms_d, both kernels chain on same stream, one sync at endMaking validation optional (e.g.
validate_norms: boolflag, defaultfalse):Keeping status quo:
Questions for the community
encode_batch_from_gpu_ptr_*) be treated as "internal / trusted input" paths where validation is skipped, while the host-data paths (encode,encode_batch) retain validation?strict=Falsedefault (skip validation, fast path) withstrict=Trueopt-in (current behavior) be acceptable?Related issues / PRs
encode_from_gpu_ptr_f32single-sample trait dispatch