perf(qdp): remove mid-stream D2H norm validation from batch GPU-pointer encoding paths

## Summary

The batch amplitude encoding paths (`encode_batch_from_gpu_ptr_f32`, `encode_batch_from_gpu_ptr`) perform a blocking CPU round-trip for norm validation between the norm kernel and the encode kernel. This serializes two GPU kernel launches unnecessarily and is the primary reason Mahout's encode-only throughput falls behind PyTorch's reference implementation.

## Current pipeline (batch path)

```
launch_l2_norm_batch_f32   (GPU, queued)
↓
cudaStreamSynchronize      ← STALL #1
dtoh_sync_copy(norms)      ← PCIe transfer: N floats CPU-bound
CPU: check norms[i] for zero/NaN
↓
launch_amplitude_encode_batch_f32  (GPU, queued)
↓
cudaStreamSynchronize      ← STALL #2
```

Two syncs + one D2H copy per batch, even when all norms are valid (which is almost always).

## Benchmark evidence

Running `benchmark_pytorch_ref.py --qubits 16 --batches 200 --batch-size 64` (encode-only mode — both frameworks start with data already on GPU):

| Framework | Throughput |
|---|---|
| PyTorch GPU | 228,825 vec/s |
| Mahout | 66,615 vec/s |
| Ratio | **0.3x** (Mahout 3.4x slower) |

PyTorch's `amplitude_encode` uses `torch.linalg.vector_norm + data / norms.clamp(min=1e-10)` — everything stays on GPU, no CPU validation.

In end-to-end mode (data gen + H2D + encode), Mahout is **7.8x faster** than PyTorch because the full pipeline cost dominates. The gap only appears when isolating kernel work.

## Related prior work

The same pattern was already fixed for the **single-sample** `encode_from_gpu_ptr_f32` path in this branch: a new `launch_amplitude_encode_f32_device_norm` CUDA kernel reads `inv_norm` from device memory, letting the norm kernel and encode kernel chain on the same stream with a single sync at the end. That change yielded **1.22x** speedup on the single-sample path (39.7 → 32.6 µs/sample at 16 qubits).

The batch path has the same structure and would benefit from the same treatment, but the decision of whether to remove or make-optional the norm validation is a policy question for the community.

## The trade-off

**Removing the D2H validation:**
- Pros: eliminates the mid-stream stall entirely; closes the gap with PyTorch; norms stay on device as `float* inv_norms_d`, both kernels chain on same stream, one sync at end
- Cons: zero-norm or NaN inputs silently produce all-zero state vectors instead of returning an error; caller must validate upstream

**Making validation optional (e.g. `validate_norms: bool` flag, default `false`):**
- Pros: preserves the safety net for development/debug use; hot path is fast
- Cons: API complexity; easy to forget to enable for debugging

**Keeping status quo:**
- Norm validation is a correctness guarantee at the encoder boundary
- Useful during development when data pipelines are untested

## Questions for the community

1. Is norm validation at the encoder level a hard requirement, or should it be the caller's responsibility?
2. Should the batch GPU-pointer paths (`encode_batch_from_gpu_ptr_*`) be treated as "internal / trusted input" paths where validation is skipped, while the host-data paths (`encode`, `encode_batch`) retain validation?
3. Would a `strict=False` default (skip validation, fast path) with `strict=True` opt-in (current behavior) be acceptable?

## Related issues / PRs

- PR #1310 — `encode_from_gpu_ptr_f32` single-sample trait dispatch
- PR #1283 — configurable CUDA kernel build targets
- PR #1275 — F32 support for angle/basis encoders

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(qdp): remove mid-stream D2H norm validation from batch GPU-pointer encoding paths #1370

Summary

Current pipeline (batch path)

Benchmark evidence

Related prior work

The trade-off

Questions for the community

Related issues / PRs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Framework	Throughput
PyTorch GPU	228,825 vec/s
Mahout	66,615 vec/s
Ratio	0.3x (Mahout 3.4x slower)

perf(qdp): remove mid-stream D2H norm validation from batch GPU-pointer encoding paths #1370

Description

Summary

Current pipeline (batch path)

Benchmark evidence

Related prior work

The trade-off

Questions for the community

Related issues / PRs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions