From dd866dee5e70f8ad411359290d1d9a236c63702f Mon Sep 17 00:00:00 2001
From: Nicholas Gates <nick@nickgates.com>
Date: Thu, 16 Apr 2026 12:17:43 -0400
Subject: [PATCH] RFC 0033: block-turboquant

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 rfcs/0033-block-turboquant.md | 1438 +++++++++++++++++++++++++++++++++
 1 file changed, 1438 insertions(+)
 create mode 100644 rfcs/0033-block-turboquant.md

diff --git a/rfcs/0033-block-turboquant.md b/rfcs/0033-block-turboquant.md
new file mode 100644
index 0000000..44a5bb7
--- /dev/null
+++ b/rfcs/0033-block-turboquant.md
@@ -0,0 +1,1438 @@
+# Block-Decomposed TurboQuant with PDX Layout
+
+**Authors:** @lwwmanning, @connortsui20
+**Status:** Proposal
+**Date:** 2026-04-02 (updated 2026-04-06)
+
+## Summary
+
+We propose evolving the [TurboQuant vector quantization encoding][current-impl]
+in stages:
+
+1. **MSE-only TurboQuant** (in progress — [PR #7269][current-impl]): a complete,
+   self-contained building block. 8-bit default, internal zero-padding for
+   non-power-of-2 dimensions, `FixedSizeListArray` rotation signs supporting
+   variable SORF rounds.
+2. **Block decomposition**: for dimensions where a valid B exists
+   (greatest power-of-2 ≥ 64 dividing d), split into blocks of size B. For
+   power-of-2 dimensions, B = d (single block). Dimensions with no qualifying
+   B fall back to internal zero-padding to power-of-2. Per-block norms stored as internal
+   children.
+3. **PDX layout** (later): transpose codes into dimension-major order within
+   groups of 64 vectors for SIMD scan performance.
+
+QJL correction is deferred to a later stage and may ultimately be dropped.
+Community findings from multiple independent TurboQuant implementations
+often show that MSE-only outperforms MSE+QJL for KV-cache attention [8].
+For ANN ranking and vector-search workloads, the evidence is currently less
+complete, so QJL should remain an empirical question rather than a settled
+conclusion.
+
+[current-impl]: https://github.com/spiraldb/vortex/pull/7269
+[original-impl]: https://github.com/spiraldb/vortex/pull/7167
+
+## Background
+
+### TurboQuant
+
+TurboQuant [1] is a lossy vector quantization algorithm for high-dimensional
+embeddings. It works by:
+
+1. Randomly rotating a unit-norm vector so that each coordinate follows a known
+   marginal distribution — specifically `(1 - x²)^((d-3)/2)` on [-1, 1], a
+   concentrated Beta distribution (Lemma 1 in [1]).
+2. Applying an MSE-optimal scalar quantizer (Max-Lloyd centroids) independently
+   to each coordinate.
+3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction
+   on the residual for unbiased inner product estimation (Theorem 2 in [1]).
+
+The paper prescribes a full random orthogonal rotation (QR decomposition of a
+matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix)
+for the MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the
+paper uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not
+an orthogonal rotation); this distinction matters for the unbiasedness proof.
+
+**Comparison to Product Quantization.** TurboQuant's block decomposition (Stage
+2 of this RFC) is structurally similar to Product Quantization (PQ) [9]: both
+partition a vector into sub-vectors and quantize each independently. The key
+differences are:
+
+|                        | TurboQuant                                                      | PQ                                                       |
+| ---------------------- | --------------------------------------------------------------- | -------------------------------------------------------- |
+| Quantization type      | Scalar (per-coordinate, after rotation)                         | Vector (per-sub-vector, learned codebook)                |
+| Codebook               | Analytically derived from Beta distribution; **data-oblivious** | Learned via k-means on training data; **data-dependent** |
+| Rotation               | Random orthogonal within each sub-vector                        | Typically none (OPQ [10] adds a learned rotation)        |
+| Theoretical guarantees | Provable data-oblivious MSE bound (Theorem 1 [1])               | No comparable data-oblivious bound                       |
+| Codebook training      | None (centroids derived from theory)                            | Requires training pass over data                         |
+| Bits per sub-vector    | Scalar: b bits per coordinate                                   | Vector: typically 8 bits per sub-vector (256 codewords)  |
+
+TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit
+structure) for data-obliviousness (no training, provable bounds, no offline
+index-training phase). Encode-time work (rotation + quantization) still applies.
+In return, PQ and OPQ retain a major advantage in expressivity: they learn
+sub-vector codebooks from data rather than applying an analytic scalar quantizer.
+In practice this means TurboQuant is attractive when training-free operation,
+simple deployment, and theoretical guarantees matter most, while PQ or OPQ may
+still win empirically when a learned vector codebook can exploit dataset-specific
+structure.
+
+### Comparison to HIGGS
+
+HIGGS [12] (Malinovskii et al., 2024) is a data-free quantization method for LLM
+weight matrices that shares TurboQuant's core idea — Hadamard rotation followed by
+MSE-optimal grid quantization — but targets a different application domain and makes
+different design trade-offs:
+
+|                      | TurboQuant                                                               | HIGGS                                                                      |
+| -------------------- | ------------------------------------------------------------------------ | -------------------------------------------------------------------------- |
+| Application domain   | ANN embedding search (per-vector, online)                                | LLM weight quantization (per-layer, offline)                               |
+| Rotation             | 3-round SORF (HD₃·HD₂·HD₁): high-quality random orthogonal approximation | Single RHT (H·D): one Hadamard × random diagonal signs                     |
+| Target distribution  | Beta marginal (1-x²)^((d-3)/2) on unit sphere                            | Approximate Gaussian N(0,1)                                                |
+| Quantization grid    | Max-Lloyd centroids (scalar, p=1), analytically derived for Beta         | CLVQ grids (Pagès & Printems 2003), supports vector quantization p∈{1,2,4} |
+| Error metric         | Pure MSE (reconstruction error)                                          | MSE + Hessian-weighted per-layer coefficients αₗ (Linearity Theorem)       |
+| Calibration data     | None                                                                     | None for quantization; small calibration set for αₗ estimation             |
+| Non-uniform bitwidth | No (uniform across all vectors)                                          | Yes (DP solver for per-layer bit allocation)                               |
+| Distance computation | Quantized-domain scan kernel (PDX layout, SIMD over 64 vectors)          | GPU matrix multiply (FLUTE kernel)                                         |
+| Norm storage         | Explicit per-block norms for distance computation                        | Per-group scales folded into weight reconstruction                         |
+
+**Key design differences explained:**
+
+- **Rotation depth.** TurboQuant normalizes to the unit sphere first, so
+  coordinates must follow the specific Beta marginal for Max-Lloyd centroids to
+  be optimal — this requires a high-quality random orthogonal approximation
+  (3-round SORF). HIGGS operates on raw (group-normalized) weights and only
+  needs approximate Gaussianity, so a single RHT suffices.
+- **VQ dimension.** HIGGS's CLVQ grids support multi-dimensional vector
+  quantization (p>1), where groups of p coordinates are quantized jointly to an
+  optimal multi-dimensional grid. At 3-4 bits, p=2 or p=4 achieves better
+  rate-distortion than scalar (p=1) quantization by exploiting residual
+  correlations between coordinates. TurboQuant is currently scalar-only (p=1);
+  p>1 would require changes to the PDX scan kernel (per-subvector codebook
+  lookup instead of per-coordinate). See Future work for discussion.
+- **Error metric.** HIGGS's Linearity Theorem (perplexity increase ≈ Σ αₗ·tₗ²)
+  enables Hessian-aware optimization specific to LLM inference. For ANN search,
+  MSE is the natural metric — it directly bounds distance distortion — and
+  non-uniform bit allocation has no analogue (all vectors share the same
+  encoding).
+- **Beta vs. Gaussian at high d.** As d grows, the Beta distribution
+  (1-x²)^((d-3)/2) concentrates and becomes approximately Gaussian with
+  variance ~1/d. At d=256+, the practical difference between Beta-optimal and
+  Gaussian-optimal grids shrinks. Whether Gaussian grids (simpler: one grid per
+  bitwidth, no dimension dependence) match Beta Max-Lloyd for ANN recall is an
+  empirical question — see Experimental plan.
+
+**Domain mismatch.** Comparisons of TurboQuant vs. HIGGS on LLM perplexity
+benchmarks are misleading: HIGGS's Hessian-aware optimization naturally dominates
+for that task, but TurboQuant was never designed for LLM weight quantization. The
+relevant comparison is ANN recall@k on embedding datasets, where TurboQuant's
+block decomposition, PDX scan layout, and per-vector encode/decode are the
+critical features.
+
+### Comparison to RotorQuant / IsoQuant
+
+RotorQuant [13] replaces TurboQuant's full-dimension SORF with Clifford algebra
+rotors in Cl(3,0), chunking vectors into 3-dimensional groups and applying SO(3)
+sandwich products. IsoQuant extends this to SO(4) via quaternions, and PlanarQuant
+uses SO(2) Givens rotations. All three are block-diagonal rotation strategies with
+very small blocks (2-4 dimensions).
+
+On real KV-cache tensors (Qwen2.5-3B), these small-block rotations showed severe
+quality regressions: RotorQuant at 3-bit measured 3.843 MSE vs. TurboQuant's
+0.354 (10.8× worse), and IsoQuant at 4-bit incurred +36% perplexity impact vs.
+TurboQuant's +11.7% [13]. Independent analysis attributed this to the fundamental
+decorrelation limitation: block-diagonal rotations in SO(2)/SO(3)/SO(4) provide
+no cross-group coordinate mixing, while WHT/SORF mixes all coordinates
+simultaneously. Real embedding vectors exhibit full-dimension correlations that
+small-block rotations cannot break.
+
+|                        | TurboQuant (SORF)                             | RotorQuant (SO(3))         | IsoQuant (SO(4))            |
+| ---------------------- | --------------------------------------------- | -------------------------- | --------------------------- |
+| Decorrelation          | Full dimension (3-round SORF, all coords mix) | Block-diagonal (3D groups) | Block-diagonal (4D groups)  |
+| Params (d=128)         | 384 sign bits (3 × 128)                       | 186 rotor params           | ~500 quaternion params      |
+| MSE at 3-bit (Qwen KV) | 0.354                                         | 3.843 (10.8× worse)        | Not reported at 3-bit       |
+| Speed vs. WHT          | Baseline (896 FMAs at d=128)                  | 2,408 FMAs (2.7× slower)   | ~3.6× slower (CUDA prefill) |
+
+**Relevance to our design.** RFC 0033's Stage 2 block decomposition is also
+block-diagonal — each B-dim block has an independent SORF with no cross-block
+mixing. The critical difference is block size: B=256 with 3-round SORF provides
+24 butterfly stages of within-block mixing (comparable to the current B=1024's
+30 stages), vs. RotorQuant's 3-4 coordinate groups with no structured mixing at
+all. The RotorQuant/IsoQuant data provides empirical evidence that the quality
+cliff for block-diagonal rotations is steep at very small B and validates the
+RFC's minimum B ≥ 64 constraint. Whether B=256 is large enough to avoid
+meaningful decorrelation loss is an empirical question addressed in the
+Experimental plan.
+
+### Current Vortex implementation
+
+The [current implementation][current-impl] (Rust, in the `vortex-tensor` crate,
+merged via [PR #7269][current-impl]) implements MSE-only TurboQuant as a Vortex
+array encoding that compresses `FixedSizeList<float>` arrays — the storage
+format of `Vector` and `FixedShapeTensor` extension types. The
+[original QJL-inclusive PR][original-impl] was closed in favor of this MSE-only
+approach. Key design choices and characteristics:
+
+**Rotation.** Instead of the paper's O(d²) QR rotation, we use a 3-round
+Structured Orthogonal Random Features (SORF) transform `HD₃·HD₂·HD₁` [5],
+giving O(d) storage (3d sign bits, bitpacked) and O(d log d) per-vector. The
+rotation signs are stored as a bitpacked child array rather than recomputed from
+a seed at decode time. The 3-round SORF was introduced for kernel approximation
+[5] and approximates a random orthogonal matrix. It is distinct from the
+single-round SRHT (`R·H·D`) analyzed by Tropp [3] and the FJLT (`P·H·D`) of
+Ailon-Chazelle [2], both of which are dimensionality-reducing projections rather
+than rotation approximations.
+
+**Centroids.** Max-Lloyd centroids are computed via numerical integration
+(trapezoid rule, 1000 points per interval) of the marginal Beta distribution at
+the padded dimension, using the `HalfIntExponent` type for exact integer/half-
+integer exponent arithmetic. Centroids are cached in a global `DashMap` keyed by
+`(dimension, bit_width)` and stored as a shared `PrimitiveArray<f32>` child.
+
+**Array structure.** The `TurboQuantArray` stores 4 child slots: codes
+(`FixedSizeListArray<u8>`, one per vector, list_size = padded_dim), norms
+(`PrimitiveArray<f32>`), centroids (`PrimitiveArray<f32>`, shared), and MSE
+rotation signs (`PrimitiveArray<u8>`, shared, bitpacked). Codes are stored as
+u8 centroid indices; the cascade compressor (BitPacked encoding) handles packing
+to the actual bit width on disk.
+
+**Compute pushdowns.** Slice and take propagate to per-row children (codes,
+norms) while sharing rotation signs and centroids. Quantized cosine similarity
+and dot product operate directly on codes and centroids without decompression.
+L2 norm returns the stored norm directly (O(1) readthrough).
+
+**Compression scheme.** `TurboQuantScheme` implements the `Scheme` trait for the
+BtrBlocks cascading compressor. It matches `Vector` and `FixedShapeTensor`
+extension arrays with non-nullable float elements and dimension ≥ 128,
+using 8-bit MSE-only as the default (256 centroids, near-lossless with
+normalized MSE ~4e-5, achieving ~4× compression on f32).
+
+**Input handling.** All float types (f16, f32, f64) are converted to f32 before
+quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2
+dimensions are zero-padded to the next power of 2 for SORF compatibility. The
+minimum dimension for scheme auto-selection is 128; the array-level minimum
+remains 3 (at d=2 the marginal is the arcsine distribution, which is U-shaped
+and unsuitable for Max-Lloyd centroids designed for concentrated distributions).
+
+**Metadata.** Currently serialized as a raw single byte (bit_width). This lacks
+framing and versioning and cannot be extended backward-compatibly; migrating to
+a structured/extensible format is a Stage 1 item (the upcoming vtable refactor
+may eliminate the need for separate serialized metadata entirely).
+
+The Eviox corrections study [7] identified several bugs in the paper's reference
+Python implementation; none affect our implementation (see Appendix A). There is
+also a notational ambiguity in the MSE bound constant; we use `√3·π/2 ≈ 2.72`
+(see Appendix A for the full analysis).
+
+Multiple independent TurboQuant implementations report that MSE-only often
+outperforms MSE+QJL for KV-cache attention at the same bit budget [8], likely
+due to softmax amplifying QJL variance. For ANN ranking the evidence is less
+settled; MSE-only is the default pending dedicated benchmarks (see Appendix B
+for details).
+
+### Current limitations
+
+The SORF requires power-of-2 input dimension. The TQ array handles this by
+zero-padding non-power-of-2 dimensions to the next power of 2 internally
+(e.g., 768 → 1024). For non-power-of-2 dimensions, this means:
+
+- **33% storage overhead** for 768-d vectors: 1024 codes stored vs. 768 useful
+  (equivalently, 25% of stored codes are wasted on zero-padded dimensions).
+- **No scan-optimized layout**: row-major code storage prevents SIMD-over-vectors
+  distance computation.
+
+Stage 2's block decomposition eliminates this padding for dimensions with a
+qualifying B (e.g., 768 → 3×256 blocks), since each block is natively
+power-of-2.
+
+### PDX
+
+PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25)
+describes a dimension-major layout within fixed-size blocks of 64 vectors,
+enabling the compiler to auto-vectorize the inner distance loop over vectors
+rather than dimensions. The paper reports an average 2× speedup for
+auto-vectorized PDX distance kernels vs. explicitly SIMD-optimized row-major
+baselines (SimSIMD, FAISS) across four architectures, with larger gains at low
+dimensionality (5.5× at D ≤ 32) and ~1.5× at D > 32 [4, Table 4].
+Dimension-pruning methods (ADSampling, BSA) recover much larger end-to-end
+gains (2-7×) when paired with the PDX layout [4]. The block size of 64 is
+empirically optimal across AVX-512, AVX2, and NEON architectures [4, Table 5].
+
+**PDX open-source implementation.** The [open-source implementation][pdx-impl]
+has evolved beyond the paper in several ways relevant to this RFC. _Note: the
+following describes the code repository, not the paper — the paper operates
+exclusively on float32 and does not discuss int8 layouts._
+
+- **8-bit scalar quantization** (`IndexPDXIVFTreeSQ8`): Maps floats to 0-255 via
+  linear min-max scaling. The int8 layout differs from float32: dimensions are
+  packed in groups of 4 ("4 dims × 16 vecs") to leverage hardware dot-product
+  instructions (VPDPBUSD on x86, UDOT/SDOT on ARM) that process 4 byte pairs
+  per operation. This is a different tiling than the paper's "1 dim × 64 vecs."
+- **ADSampling with random rotation**: The pruner applies a random orthogonal
+  rotation to the entire collection as a preprocessing step. This makes
+  coordinates approximately independent, enabling dimension-by-dimension
+  hypothesis testing for early pruning. The rotation serves a similar purpose
+  to TurboQuant's rotation — making the coordinate distribution known — but for
+  pruning rather than quantization.
+- **Dimension zones**: Consecutive dimensions are grouped into zones; at query
+  time, zones are ranked by "distance-to-means" and the most discriminative
+  zones are scanned first, enabling faster pruning (~30% faster than
+  per-dimension pruning [4]).
+
+**Implications for our design.** The PDX paper's float32 layout ("1 dim × 64
+vecs") maps cleanly to our quantized-code scan kernel, where the inner loop
+gathers from a centroid-product distance table over 64 vectors. However, if we
+pursue direct int8 arithmetic (b_mse=8 with linear centroids, see GPU section),
+the "4 dims × 16 vecs" int8 layout from the PDX implementation may be more
+appropriate, as it enables hardware dot-product instructions.
+
+Additionally, ADSampling's dimension-pruning approach is complementary to
+TurboQuant's block structure: when scanning with block decomposition, the pruner
+could skip entire TQ blocks (B dimensions at a time) if the partial distance
+already exceeds the candidate threshold. This combines the storage efficiency of
+quantization with the computational savings of early termination.
+
+[pdx-impl]: https://github.com/cwida/PDX "specific files: `include/pdx/quantizers/scalar.hpp` for SQ8, `include/pdx/pruners/adsampling.hpp` for ADSampling, `include/pdx/layout.hpp` for int8 interleaving, `include/pdx/distance_computers/avx512_computers.hpp` for VPDPBUSD kernels"
+
+## Proposal
+
+### Block size strategy
+
+For each dimension d, choose B = the greatest power-of-2 ≥ 64 that evenly
+divides d. If no such B exists (e.g., d=96), the TQ array falls back to
+internal zero-padding (single padded block, as in Stage 1). For common embedding
+dimensions, this rule always produces a valid B and avoids padding entirely:
+
+| Dimension d | Block size B | Blocks k | Notes                        |
+| ----------- | ------------ | -------- | ---------------------------- |
+| 512         | 512          | 1        | Single block (= current TQ)  |
+| 768         | 256          | 3        | Greatest dividing power-of-2 |
+| 1024        | 1024         | 1        | Single block                 |
+| 1536        | 512          | 3        |                              |
+| 2048        | 2048         | 1        | Single block                 |
+| 3072        | 1024         | 3        |                              |
+| 4096        | 4096         | 1        | Single block                 |
+
+**Key observations:**
+
+- **Power-of-2 dimensions** (512, 1024, 2048, 4096) use B = d — a single block,
+  identical to the current implementation except with PDX underneath (Stage 3).
+  No block decomposition overhead, no per-block norms. These dimensions are
+  already well-served by the current design.
+- **Non-power-of-2 dimensions** (768, 1536, 3072) decompose into k=3 blocks at
+  B=256 or B=512. No padding waste.
+  Each block has its own SORF rotation and shares a single centroid set.
+- **No qualifying B is rare** for common embedding dimensions. Dimensions where
+  no power-of-2 ≥ 64 divides d (e.g., 96, 100) fall back to internal
+  zero-padding. A future straggler-block extension could handle these
+  without padding (see Stage 2: Straggler blocks). These dimensions are uncommon
+  in modern model architectures.
+- **The SORF approximation at B=256+ is expected to be adequate**: 3 rounds at
+  B=256 provides 24 butterfly stages, and at B=512 provides 27 — both comparable
+  to the current B=1024 (30 stages). This needs empirical validation; see
+  Experimental plan.
+
+### Minimum dimension
+
+The compression scheme should only select TurboQuant for vectors with
+dimension ≥ 128. Below this threshold, several factors degrade quality and
+efficiency:
+
+- **SORF mixing quality:** 3-round SORF at d=64 provides only 18 butterfly
+  stages (vs. 21 at d=128, 30 at d=1024). The coordinate distribution deviates
+  more from the analytical Beta, making Max-Lloyd centroids less optimal.
+  Stage 1's variable-round rotation signs (see Stage 1) may allow compensating with
+  additional SORF rounds at lower dimensions — this should be benchmarked.
+- **Practical MSE:** At smaller d, the SORF mixing quality and coordinate-
+  independence approximations are weaker, potentially worsening practical
+  quantization quality beyond what the dimension-free theoretical bound
+  captures. The actual MSE at each d is an empirical question.
+- **Overhead ratio:** Per-vector norm (32 bits) is a larger fraction of the
+  compressed representation at small d. At d=32, b=5: codes=160 bits,
+  norm=32 bits, total=192 — norm is ~17% of compressed size. At d=768: <1%.
+- **Diminishing returns for high bit widths:** With fewer coordinates, the
+  fine-grained centroid structure of high-b quantization has less to exploit.
+
+The threshold of 128 is conservative:
+
+- d=128 (SIFT) is the smallest dimension in our recommended benchmark table.
+- SORF at d=128 has 21 butterfly stages — tested and adequate in the current
+  implementation.
+- The block-size rule produces B=128 for d=128 (single block, no decomposition).
+
+Whether TQ works well at all below d=64 is an open question — SORF mixing
+quality degrades rapidly at small dimensions, and the overhead ratio makes TQ
+increasingly uncompetitive vs. simpler scalar quantization. The scheme minimum
+of 128 is conservative; the experimental plan should determine the true
+minimum (likely in the 64-128 range). Padding modest amounts (e.g., 96 → 128)
+is probably acceptable; padding large fractions (e.g., 32 → 64) is not.
+
+The exact threshold should be validated experimentally — see Experimental plan.
+
+### Stage 1: MSE-only TurboQuant (in progress — [PR #7269][current-impl])
+
+Stage 1 delivers MSE-only TurboQuant as a complete, self-contained building
+block. The [initial implementation][current-impl] is merged; the
+[original QJL-inclusive PR][original-impl] was closed in favor of this MSE-only
+approach. Work remaining to complete Stage 1 is described below.
+
+The goal is to arrive at a wire format that we believe is ready for
+backward-compatibility guarantees — one we would be comfortable freezing — without
+formally committing to stability until confirmed by Stage 2 implementation and
+benchmarking.
+
+**Target properties:**
+
+- **MSE-only, no QJL.** 4 child slots: codes, norms, centroids, rotation_signs.
+  QJL code can be resurrected from the [original PR][original-impl] if Phase 4
+  is pursued.
+- **8-bit default** (256 centroids). Near-lossless: normalized MSE ~4e-5,
+  ~4× compression on f32. Lower bit widths available via `TurboQuantConfig`.
+- **Power-of-2 block size with internal padding.** The TQ array requires
+  `block_size` to be a power of 2. Non-power-of-2 dimensions are zero-padded
+  internally to the next power of 2 (e.g., 768 → 1024), so `codes.list_size`
+  (= `padded_dim`) may exceed `dimension`. Stage 2's block decomposition
+  eliminates this padding for dimensions with a qualifying B (e.g., 768 →
+  3×256 blocks, each natively power-of-2).
+- **Variable-round SORF rotation.** Rotation signs are stored as a
+  `FixedSizeListArray` where each element is a
+  `FixedSizeList(u8, padded_dim, NonNullable)` — one bitpacked diagonal per
+  SORF round. The array length R equals the number of rounds (default 3). This
+  makes the round count a property of the array shape rather than a hard-coded
+  constant. More rounds may improve mixing quality at lower dimensions or lower
+  bit widths (see Experimental plan: "Test 3, 4, 5 SORF rounds at each B").
+  Signs are stored in inverse-friendly (read-optimized) order.
+- **Scheme auto-selection** for dimension ≥ 128 (see Minimum dimension).
+  Smaller power-of-2 dimensions remain available via explicit construction.
+- **Compute pushdowns**: slice/take/scalar_at, quantized cosine similarity and
+  dot product, compression scheme integration.
+- **Dtype-matching norms**: f64 for f64 input, f32 for f32/f16.
+- **Codes and centroids remain separate children.** The codes
+  (`FixedSizeListArray<u8>`) and centroids (`PrimitiveArray<f32>`) are
+  independent child slots. Operations that need a unified view (e.g.,
+  `canonicalize`) can construct a `DictArray` from codes and centroids and
+  apply the inverse rotation to produce a canonical decoded form.
+
+**Forward-compatible metadata:** `dimension: u32`, `block_size: u32` (=
+padded_dim in Stage 1), `num_blocks: u32` (always = 1 in Stage 1),
+`num_rounds: u32` (= R, default 3). These fields are inert in Stage 1 but
+enable Stage 2 decoders to read Stage 1
+files. The serialization format is TBD — the upcoming vtable refactor may make
+the current raw-byte metadata unnecessary by encoding these fields directly in
+the vtable. If the refactor does not land first, a structured format (e.g.,
+protobuf) is needed. (PDX is handled via the codes child type, not a metadata
+flag — see Stage 3.)
+
+**Remaining work** (relative to the [initial implementation][current-impl]):
+
+- Restructure rotation signs from flat `PrimitiveArray<u8>` to
+  `FixedSizeListArray` (variable SORF rounds, as described above).
+- Dtype-matching norms (currently always f32).
+- Structured metadata (currently a raw single byte).
+- Restrict `new_unchecked` visibility to `pub(crate)`.
+- f64-to-f32 truncation in encode path: add comment or checked cast.
+- CENTROID_CACHE: document intentional unbounded-ness.
+- Note MSE bound caveat: Theorem 1 is proved for Haar matrices, not SORF.
+
+### Stage 2: Block decomposition
+
+Block decomposition splits a `FixedSizeListArray` vertically by dimension into
+fixed-size blocks, each encoded independently. This is structurally analogous
+to `ChunkedArray` (which splits horizontally by rows) — both are general-purpose
+structural transforms over arrays, not specific to any particular encoding. Like
+PDX (Stage 3), block decomposition is a layout concern that can wrap arbitrary
+child encodings.
+
+In the initial implementation, block decomposition is embedded inside
+`TurboQuantArray` — all blocks use TQ MSE-only encoding with independent SORF
+rotations, and TQ-specific children (centroids, rotation signs) are stored
+alongside the blocks. However, the _concept_ of block decomposition is
+encoding-agnostic: a future refactor could extract it into a general-purpose
+`BlockDecomposedFSLArray` that wraps k independently-encoded child arrays. This
+matters for straggler-block support (see below), where the straggler may use a
+different encoding than the main blocks.
+
+For dimensions where the block-size rule produces a valid B (see table above),
+the scheme splits the input into k = d/B blocks of size B. Each block is a
+power-of-2 TQ array with an independent B-dim SORF rotation.
+
+**Changes vs. Stage 1 (with TQ blocks):**
+
+| Aspect                | Stage 1                                  | Stage 2                                                                      |
+| --------------------- | ---------------------------------------- | ---------------------------------------------------------------------------- |
+| Block count           | k = 1 (single power-of-2 block)          | **k = d/B** (multiple blocks)                                                |
+| SORF dimension        | padded_dim (next power-of-2 ≥ dim)       | **B** (e.g., 256 for d=768)                                                  |
+| Rotation signs        | `FSL`, len = R, element dim = padded_dim | **`FSL`, len = k × R**, element dim = B                                      |
+| Centroids             | Computed for padded_dim distribution     | **Computed for B-dim distribution** (different codebook!)                    |
+| Norms child           | `PrimitiveArray<F>`, 1 per vector        | **`PrimitiveArray<F>` (k=1) or `FixedSizeListArray<F>` (k>1)**, same dtype F |
+| Codes list_size       | padded_dim                               | **k × B** (= d)                                                              |
+| Scheme compress()     | Single SORF → quantize                   | **Choose B → split → per-block normalize/rotate/quantize**                   |
+| Quantized dot product | Single sum over padded_dim centroids     | **Per-block weighted sum** (Σ_k norm_a_k · norm_b_k · unit_dot_k)            |
+| L2 norm readthrough   | O(1) — return stored norm                | **O(k)** — compute √(Σ_k norm_k²)                                            |
+
+**Unchanged from Stage 1:** SORF construction (R-round HD, default R=3),
+Max-Lloyd algorithm, f32 internal quantization, slice/take semantics (per-row
+data sliced, shared data cloned), `FixedSizeListArray` rotation sign storage,
+compression scheme trait.
+
+**For power-of-2 dimensions**: B = d, k = 1. The encoding produces an identical
+wire format to Stage 1 (single norm, single SORF, single codes block). A
+Stage 2 encoder writing k=1 data is fully backward-compatible with Stage 1
+decoders.
+
+**Key design properties:**
+
+- **Structural, not encoding-specific.** The block decomposition itself is a
+  vertical split of a `FixedSizeListArray` by dimension. Each block is an
+  independently-encoded child. In the initial implementation all blocks are TQ
+  MSE-only, but the structure allows heterogeneous child encodings in future.
+- **One shared centroid set** for all TQ blocks at the same B-dim distribution.
+- **Per-block SORF rotation signs.** Each block's SORF is independent (different
+  seed). Signs are R × B bits per block (R = number of SORF rounds, default 3),
+  stored as a `FixedSizeListArray` with len = k × R.
+
+#### Straggler blocks (future work)
+
+The current block-size rule requires B to evenly divide d, so dimensions with no
+qualifying power-of-2 B ≥ 64 (e.g., d=96) fall back to internal zero-padding
+(single padded block, as in Stage 1).
+A natural extension is **straggler blocks**: allow k blocks where k-1 are
+full-size B and the final block covers the remaining d - (k-1)×B dimensions.
+
+Because the block decomposition is encoding-agnostic (each block is an
+independently-encoded child array), the straggler block need not use the same
+encoding as the main blocks. For example, d=800 could be decomposed as 3×256
+= 768 TQ-encoded dimensions plus a 32-dimension straggler. SORF is unlikely
+to be effective at such small straggler dimensions (see Minimum dimension),
+so the straggler would use a different strategy:
+
+- **Uncompressed**: store the straggler dimensions as raw floats. Simplest;
+  the overhead is modest (32 × 4 = 128 bytes per vector for a 32-dim
+  straggler).
+- **Padded TQ**: pad the straggler to the next power-of-2 (e.g., 32 → 64),
+  encode with standard TQ. Only viable if the padded dimension is large enough
+  for SORF to be effective (≥ 64, probably ≥ 128).
+- **Exact-rotation TQ**: use a dense random orthogonal matrix (QR of Gaussian)
+  instead of SORF for the straggler block. Eliminates the power-of-2 constraint
+  at the cost of O(B_s²) rotation, where B_s is the straggler size.
+- **Scalar quantization or PQ**: the block decomposition structure supports
+  heterogeneous child encodings.
+
+Note that for some dimensions (e.g., d=800), padding the entire vector to the
+next power-of-2 (1024) may be preferable to block decomposition with a
+straggler, depending on the overhead tradeoff. This is an empirical question.
+
+This is deferred: the block-size rule already handles all common embedding
+dimensions (768, 1024, 1536, etc.) without stragglers, and the rare
+no-qualifying-B case (d=96) is adequately served by internal zero-padding for
+now.
+
+#### Norm architecture
+
+Per-block norms are stored as an **internal child** of the TurboQuant array:
+
+- For k = 1 (power-of-2 dims): `PrimitiveArray<F>` with len = num_rows
+  (identical to Stage 1's single-norm layout).
+- For k > 1: `FixedSizeListArray<F>` with list_size = k, len = num_rows.
+
+The norm dtype `F` matches or widens the input element type:
+
+| Input dtype | Norm dtype | Rationale                                      |
+| ----------- | ---------- | ---------------------------------------------- |
+| f16         | f32        | f16 has insufficient range/precision for norms |
+| f32         | f32        | Same type                                      |
+| f64         | f64        | Preserve full precision                        |
+
+Norms are stored as plain child arrays; the cascading compressor handles
+secondary encoding (ALP, Pco, etc.).
+
+Note: centroids and quantization always operate in f32 internally (the
+[current implementation][current-impl] converts all input to f32 before
+quantization). For f64 input, decode produces f32 unit-direction reconstructions
+scaled by f64 norms — a mixed-precision multiply that preserves norm precision.
+
+#### Zero-norm sub-vectors
+
+When splitting a vector into B-dim blocks, some blocks may have zero norm. The
+encoding handles ‖xₖ‖ = 0 explicitly: skip rotation and quantization, store
+norm = 0, decode as all zeros.
+
+#### Theoretical MSE bound
+
+The paper's MSE bound (Theorem 1 in [1]) is:
+
+```
+E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b
+```
+
+**Crucially, Theorem 1 is proved for true random orthogonal matrices (QR of
+Gaussian), not SORF.** Our SORF is an approximation. The bound holds exactly
+only with a true random orthogonal rotation or with empirical SORF validation
+(see Experimental plan).
+
+Assuming the per-block MSE bound holds, for a vector split into blocks the
+first line is an **algebraic** identity (exact); the inequality on the second
+line applies Theorem 1's **probabilistic** bound to each block and should be
+read as holding in **expectation** over independent per-block rotations, not
+almost surely:
+
+```
+‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²)      (exact)
+    E[...]         ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound          (in expectation)
+```
+
+The conclusion: `E[‖x - x̂‖² / ‖x‖²] ≤ MSE_bound` assuming independent
+per-block rotations. (Theorem 1 applies because each block is normalized to
+unit norm before rotation and quantization; the per-block encoding pipeline is:
+split → normalize → rotate → quantize, matching the theorem's unit-sphere
+assumption.) Note that TurboQuant's original analysis uses a single
+global rotation in high-d where coordinates are nearly independent; with
+smaller block dimension B, within-block coordinate dependence after rotation may
+be stronger even when marginals are correct — this is an additional motivation
+for the experimental plan's comparison of block sizes.
+
+**Empirical evidence from small-block rotations.** The RotorQuant/IsoQuant
+experiments [13] provide direct evidence of this decorrelation failure mode:
+block-diagonal rotations in SO(3) (3-dim groups) and SO(4) (4-dim groups)
+caused 10× MSE regressions on real KV-cache vectors, attributed to complete
+absence of cross-group coordinate mixing. Our Stage 2 design operates at a
+fundamentally different scale — B=256 blocks with 3-round SORF provide 24
+butterfly mixing stages within each block, vs. RotorQuant's 3-4 raw coordinates
+with no structured mixing — so the decorrelation loss should be far less severe.
+Nevertheless, the experimental plan includes explicit cross-block correlation
+measurement on real embeddings to quantify any residual decorrelation gap
+between block-decomposed (B=256) and single-block (B=d) SORF.
+
+The actual MSE may depend on block dimension B: at larger B the coordinate
+distribution is more concentrated (variance ~1/B), giving the Max-Lloyd
+quantizer more to exploit. See Experimental plan.
+
+**SORF approximation.** The R-round SORF `HD_R·...·HD₂·HD₁` [5] provides
+log₂(B) butterfly stages per round × R rounds = R·log₂(B) total. At R=3
+(default): 18 at B=64, 24 at B=256, 27 at B=512. At R=5: 30 at B=64, 40 at
+B=256. Counting butterfly stages is a rough heuristic for mixing quality with
+no theoretical backing: [5] proves near-unbiasedness for kernel approximation
+(Theorem 3) and pairwise near-orthogonality (Theorem 4), but does **not** prove
+distributional closeness to Haar measure, does not analyze convergence rate as
+a function of rounds × dimension, and leaves tight variance bounds for SORF as
+an open problem. The variable-round rotation signs (Stage 1) enable testing
+more rounds at smaller B or lower bit widths where mixing quality matters more.
+Empirical validation is needed.
+
+**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a
+B × B random orthogonal matrix (QR of Gaussian). Storage at B=256: 256 KB per
+block. For d=768 with k=3: 768 KB total. Amortizes for large columns (100K+
+vectors). Each block must have an **independent** rotation matrix.
+
+DCT and other fixed structured transforms are not suitable for TurboQuant's
+rotation (they do not produce the required Beta marginal). Sharing a rotation
+with ADSampling-style pruning is a speculative future direction. See Appendix C
+for details on both.
+
+#### Quantized-domain operations
+
+All quantized operations read per-block norms from the internal child array:
+
+- **L2 distance**: `‖a-b‖² = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2·Σ_k ‖aₖ‖·‖bₖ‖·
+unit_dotₖ`. Primary ANN metric; reuses per-block dot product and norms.
+- **Dot product**: `<a,b> ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] ·
+centroids[code_bₖ[j]]`.
+- **Cosine similarity**: `cos(a,b) ≈ dot(a,b) / (‖a‖·‖b‖)` where
+  `‖a‖ = √(Σ_k ‖aₖ‖²)`.
+- **L2 norm**: `√(Σ_k ‖xₖ‖²)`. O(k) per vector — a regression from the
+  current O(1) single-norm readthrough, but modest.
+
+#### Encoding algorithm
+
+```
+
+Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B
+k = d / B (exact division, no straggler for chosen B)
+num_centroids = 2^b_mse
+
+# Block split and normalize
+
+for i in 0..k:
+xᵢ = x[i*B .. (i+1)*B]
+nᵢ = ‖xᵢ‖
+if nᵢ > 0:
+ûᵢ = xᵢ / nᵢ
+else:
+ûᵢ = zeros(B)
+
+# MSE stage (per block, SORF rotation)
+
+for i in 0..k:
+if nᵢ > 0:
+rᵢ = SORFᵢ(ûᵢ)
+cᵢ[j] = nearest_centroid(rᵢ[j])
+else:
+cᵢ[j] = 0
+
+Store (all as internal children):
+codes (k × B per vector), norms (k per vector),
+centroids (2^b_mse, shared), SORF signs (k × R × B, shared; R = SORF rounds)
+
+```
+
+#### Decoding algorithm
+
+```
+
+for i in 0..k:
+r̂ᵢ[j] = centroids[cᵢ[j]]
+ûᵢ = SORF⁻¹ᵢ(r̂ᵢ)
+x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child)
+x̃ = concat(x̂₀, ..., x̂ₖ₋₁)
+
+```
+
+### Stage 3: PDX dimension-major layout
+
+Introduce a new `PDXArray` encoding type that wraps any `FixedSizeListArray`
+with a dimension-major layout within groups of 64 vectors [4]. Like block
+decomposition (Stage 2), PDXArray is a **structural transform** over
+`FixedSizeListArray`, not specific to any particular encoding — it is a
+general-purpose layout optimization for any FixedSizeList of scalar elements
+(raw float vectors, scalar-quantized vectors, TurboQuant codes, etc.).
+
+**Changes vs. Stage 2:**
+
+| Aspect           | Stage 2                                          | Stage 3                                                                         |
+| ---------------- | ------------------------------------------------ | ------------------------------------------------------------------------------- |
+| Codes child type | `FixedSizeListArray<u8>`                         | **`PDXArray<u8>`** (wraps FSL with transposed layout)                           |
+| Codes detection  | N/A (codes always FSL)                           | **TQ checks child type**: FSL → row-major decode, PDXArray → un-transpose first |
+| Distance kernel  | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup**               |
+| Decode path      | Direct inverse SORF per vector                   | **PDXArray.to_fsl() first**, then inverse SORF                                  |
+
+**Unchanged from Stage 2:** Block size B, centroid computation, norm storage,
+SORF rotation, all encoding logic. The encode path produces row-major codes
+(FSL), then the compressor wraps them in a PDXArray; the decode path converts
+PDXArray back to FSL then decodes.
+
+**PDXArray design:**
+
+```
+
+PDXArray<T> (general-purpose dimension-major layout for FixedSizeList)
+├── metadata: { list_size, chunk_size (= 64) }
+├── elements: PrimitiveArray<T> # transposed: 64 values per dim, contiguous
+├── validity: ... # same as FSL validity
+
+```
+
+- `PDXArray::try_new(fsl)` — transposes a FixedSizeListArray into PDX layout
+- `PDXArray::to_fsl()` — un-transposes back to row-major FSL (for decode,
+  scalar_at, or non-aligned slice/take)
+- `PDXArray::elements_for_dim(dim, chunk)` — O(1) access to a contiguous slice
+  of 64 values for one dimension within one chunk
+- Slice/take: un-transpose to FSL (simplest). Un-transpose cost is
+  O(rows × list_size) per operation; consider 64-row-aligned fast paths for
+  hot scan workloads. Preserving PDX layout is possible only for
+  64-vector-aligned ranges.
+- The cascade compressor treats PDXArray as a valid encoding of FSL-typed data.
+
+**Benefits of PDXArray as a separate type:**
+
+- PDX logic tested and maintained independently of TurboQuant
+- Other encodings (raw float vectors, scalar quantization, future encodings)
+  get PDX scan performance for free
+- TurboQuant doesn't need an `is_pdx` metadata flag — it checks its codes
+  child's type at runtime
+- The distance kernel operates on PDXArray's dimension-contiguous slices
+
+Within each 64-vector chunk, codes are stored dimension-major:
+
+```
+
+TQ block 0, dim 0: [v0 v1 v2 ... v63]
+TQ block 0, dim 1: [v0 v1 v2 ... v63]
+...
+TQ block 0, dim (B - 1): [v0 v1 v2 ... v63]
+TQ block 1, dim 0: [v0 v1 v2 ... v63]
+...
+
+```
+
+The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block
+boundaries only affect where norm weighting occurs — they don't affect the
+transpose.
+
+**Quantized distance kernel (dot product):**
+
+```rust
+let dist_table = precompute_product_table(&centroids);
+// At b_mse=4: 16×16 = 256 floats = 1KB, fits in L1
+
+let mut distances = [0.0f32; 64];
+let mut unit_dots = [0.0f32; 64];
+let mut offset = 0;
+
+for tq_block in 0..k {
+    for dim in 0..B {
+        let qd = query_codes[tq_block * B + dim];
+        let row = &dist_table[qd as usize];
+        for v in 0..64 {  // SIMD-friendly: no inter-vector deps
+            unit_dots[v] += row[codes[offset] as usize];
+            offset += 1;
+        }
+    }
+    // Weight per-block unit-norm dot product by both vectors' block norms
+    for v in 0..64 {
+        distances[v] += query_norms[tq_block] * data_norms[v][tq_block]
+                        * unit_dots[v];
+        unit_dots[v] = 0.0;  // reset for next TQ block
+    }
+}
+```
+
+**Int8 layout variant.** The PDX implementation [pdx-impl] uses a different
+tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware
+dot-product instructions (which process 4 unsigned×signed byte pairs per
+operation). For TurboQuant codes at b_mse ≤ 8, codes are uint8 centroid indices,
+so VPDPBUSD doesn't apply directly — we need the distance-table-lookup path
+shown above. However, at b_mse=8 with high B, the Max-Lloyd centroids are
+near-uniformly spaced (see GPU section), potentially enabling direct hardware
+dot-product on the codes. Whether this requires a separate linear quantization
+mode or works with the existing Max-Lloyd centroids is an empirical question. The
+"4 dims × 16 vecs" layout would be a Stage 3 optimization to evaluate alongside
+the "1 dim × 64 vecs" float-style layout.
+
+**ADSampling integration.** The PDX dimension-pruning approach (ADSampling [4])
+is complementary to TurboQuant's block structure. During a scan, the pruner
+could evaluate partial distances after each TQ block (B dimensions) and skip
+remaining blocks if the partial L2 distance already exceeds the candidate
+threshold. This requires the per-block norm weighting to happen at block
+boundaries (as shown in the kernel above), which our design already provides.
+
+**Open design questions:**
+
+- Should PDXArray live in `vortex-array` (general infrastructure) or
+  `vortex-tensor` (vector-specific)?
+- Should the cascade compressor automatically PDX-transpose FSL children when
+  it detects a scan-heavy workload, or should PDX be opt-in?
+- Should we support the "4 dims × 16 vecs" uint8 layout variant (for hardware
+  dot-product) alongside the "1 dim × 64 vecs" float-style layout?
+
+### QJL correction (deferred — experimental)
+
+Based on community findings [8], QJL is deferred to after the MSE stages are
+validated.
+
+**Changes vs. MSE-only (if pursued):**
+
+| Aspect                 | MSE-only                         | MSE + QJL                                                       |
+| ---------------------- | -------------------------------- | --------------------------------------------------------------- |
+| Bit budget             | All b bits → MSE (2^b centroids) | b-1 bits MSE + 1 bit QJL (2^(b-1) centroids)                    |
+| Inner product estimate | Biased (MSE quantization noise)  | Unbiased (QJL correction; see TurboQuant_prod in [1])           |
+| Additional children    | None                             | QJL signs, QJL residual norms, QJL projection params            |
+| Encode cost            | SORF only                        | SORF + QJL projection (O(B²) for Gaussian, O(B log B) for SORF) |
+| Decode cost            | Inverse SORF only                | Inverse SORF + QJL inverse projection                           |
+
+If pursued, four strategies should be compared:
+
+| Strategy           | Theoretical           | Speed            | Storage      |
+| ------------------ | --------------------- | ---------------- | ------------ |
+| Per-block Gaussian | Correct (Lemma 4 [1]) | O(B²)/block      | k×B²×4 bytes |
+| Per-block SORF     | Approximate           | O(B log B)/block | k×R×B bits   |
+| Full-dim SORF      | Approximate           | O(d log d) total | R×d bits     |
+| MSE-only (no QJL)  | N/A                   | 0                | None         |
+
+The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically
+for Gaussian. SORF for QJL is an additional approximation (the
+[original QJL implementation][original-impl] used SORF for QJL). Per-block QJL can
+incur up to d/B times larger variance bound than full-dimension QJL (Lemma 4
+[1]), depending on how query and residual energy are distributed across blocks.
+
+Community reports indicate MSE-only often wins for KV-cache attention at all
+tested bit widths [8]. Whether this extends to ANN ranking is an empirical
+question (see Experimental plan); QJL may not be worth the complexity. Note:
+the [original QJL PR][original-impl] flagged a known SORF-related QJL bias for
+non-power-of-2 padded dimensions (#7245); the merged MSE-only encoding avoids
+this path.
+
+## Array layout
+
+### Stage 1 (MSE-only single block)
+
+```
+TurboQuantArray
+├── metadata: { dimension, b_mse,
+│               block_size (= padded_dim, next power-of-2 ≥ dimension),
+│               num_blocks (= 1), num_rounds (= R, default 3) }
+│
+│  # Per-row children
+├── codes: FixedSizeListArray<u8>           # list_size = padded_dim
+│          (or PDXArray<u8> after Stage 3)
+├── norms: PrimitiveArray<F>               # len = num_rows (F = f64 for f64, f32 otherwise)
+│
+│  # Shared children
+├── centroids: PrimitiveArray<f32>          # len = 2^b_mse
+├── mse_rotation_signs: FixedSizeListArray  # len = R (default 3)
+│     element dtype: FixedSizeList(u8, padded_dim, NonNullable)
+│     # each element = one bitpacked sign diagonal, inverse-friendly order
+```
+
+For power-of-2 dimensions, `padded_dim = dimension` (no waste). For
+non-power-of-2 (e.g., d=768), `padded_dim = 1024` (33% overhead, eliminated
+by Stage 2 block decomposition).
+
+The codes child is `FixedSizeListArray` in Stages 1-2 and may be swapped to
+`PDXArray` in Stage 3 — TurboQuant checks the child type at runtime, not via
+a metadata flag.
+
+### Stage 2 (block decomposition)
+
+```
+TurboQuantArray (self-contained, handles blocks internally)
+├── metadata: { dimension, b_mse, block_size, num_blocks,
+│               num_rounds }
+│
+│  # Per-row children (sliced/taken on row operations)
+├── codes: FixedSizeListArray<u8>           # list_size = k × B
+│          (or PDXArray<u8> after Stage 3)
+├── norms: PrimitiveArray<F>                # len = num_rows (k=1)
+│      or  FixedSizeListArray<F>            # list_size = k (k>1)
+│
+│  # Shared children (cloned on row operations, not sliced)
+├── centroids: PrimitiveArray<f32>          # len = 2^b_mse
+├── mse_rotation_signs: FixedSizeListArray  # len = k × R
+│     element dtype: FixedSizeList(u8, B, NonNullable)
+│     # k blocks × R rounds, each element = one bitpacked sign diagonal
+```
+
+## Compression ratio
+
+For f32 input, b_mse bits MSE, k = d/B blocks, N vectors (for f64 input,
+replace 32 with 64 in the norms row — ratios decrease accordingly):
+
+| Component   | Bits per vector |
+| ----------- | --------------- |
+| MSE codes   | k × B × b_mse   |
+| Block norms | k × 32          |
+
+| Component  | Shared bits  |
+| ---------- | ------------ |
+| Centroids  | 2^b_mse × 32 |
+| SORF signs | k × R × B    |
+
+### Worked examples (f32, N=1000)
+
+**At b_mse=8 (default, near-lossless):**
+
+| d            | B    | k   | Per-vec bits          | Ratio | Notes                    |
+| ------------ | ---- | --- | --------------------- | ----- | ------------------------ |
+| 768          | 256  | 3   | 3×256×8 + 3×32 = 6240 | 3.9×  | Block decomp; no padding |
+| 1024         | 1024 | 1   | 1024×8 + 32 = 8224    | 4.0×  | Single block (= current) |
+| 768 (padded) | 1024 | 1   | 1024×8 + 32 = 8224    | 3.0×  | Padded; 33% overhead     |
+
+**At b_mse=5 (32 centroids):**
+
+| d            | B    | k   | Per-vec bits          | Ratio | Notes                    |
+| ------------ | ---- | --- | --------------------- | ----- | ------------------------ |
+| 768          | 256  | 3   | 3×256×5 + 3×32 = 3936 | 6.2×  | Block decomp; no padding |
+| 1024         | 1024 | 1   | 1024×5 + 32 = 5152    | 6.4×  | Single block (= current) |
+| 768 (padded) | 1024 | 1   | 1024×5 + 32 = 5152    | 4.8×  | Padded; 33% overhead     |
+
+Block decomposition improves the compression ratio at both bit widths. At b=8
+for d=768: from ~3.0× (padded) to ~3.9× (block decomp). At b=5 for d=768: from
+~4.8× to ~6.2×. For d=1024, the encoding is identical to current (single block).
+
+**Shared overhead note:** centroids and SORF signs are amortized over N vectors;
+for small N, per-column shared metadata is significant — report totals with and
+without amortization when publishing ratios.
+
+## Performance analysis
+
+### Encode/decode throughput
+
+SORF at B dimensions (heuristic — real cost is dominated by memory bandwidth
+and constant factors): R × B × log₂(B) butterflies + R × B sign applications
+per block (R = SORF rounds, default 3; plus B normalization multiplies,
+omitted). For k blocks, R=3:
+
+| B              | SORF FLOPs/block          | k (d=768) | Total MSE FLOPs |
+| -------------- | ------------------------- | --------- | --------------- |
+| 256            | 3×256×8 + 768 = 6,912     | 3         | 20,736          |
+| 512            | 3×512×9 + 1536 = 15,360   | —         | —               |
+| 1024 (current) | 3×1024×10 + 3072 = 33,792 | 1         | 33,792          |
+
+Block decomposition at d=768 is ~40% fewer FLOPs than the padded single-block
+approach, despite more blocks, because each block is smaller.
+
+### Benchmarking plan
+
+1. Encode/decode throughput: block TQ vs. current TQ at d=128, 768, 1024
+2. Quantized cosine similarity: block vs. current
+3. L2 norm readthrough: O(k) vs. O(1)
+4. PDX scan throughput vs. row-major (Stage 3)
+
+## Experimental plan
+
+### Minimum dimension threshold
+
+Test TurboQuant quality at d ∈ {32, 64, 96, 128, 256} to validate the scheme
+minimum of 128:
+
+- Compare TurboQuant MSE distortion and ANN recall@k against scalar
+  quantization matched on **total compressed bits per vector** (codes + norm +
+  amortized shared metadata), not just bits-per-coordinate — this is critical
+  at small d where norm overhead is significant
+- Plot the crossover point: at what d does TurboQuant's recall@k drop below
+  the rate-matched scalar baseline?
+- Test SORF coordinate distribution quality at each d (histogram vs. Beta)
+- Measure overhead ratio (norm bits / total compressed bits) at each d
+
+The scheme minimum should be set at the smallest d where TurboQuant reliably
+beats the scalar baseline on recall@k across the benchmarking datasets. Default
+scalar baseline: per-dimension linear min-max quantization at b bits per
+coordinate plus an f32 norm (matching TurboQuant's norm overhead). Report
+results at a reference N (e.g., N=100K vectors) where shared metadata is
+amortized; optionally show sensitivity to small N where shared costs dominate.
+The current proposal of 128 is conservative; experiments may justify lowering
+to 64 or raising to 256.
+
+### MSE quality and scan performance vs. block size
+
+- Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-block at
+  full power-of-2 dimension, at bit widths b ∈ {2, 3, 4, 5, 8}
+- Compare ANN recall@k and scan throughput at fixed d (e.g., d=3072) across
+  B ∈ {256, 512, 1024} — smaller B gives more pruning checkpoints for
+  ADSampling-style early termination but increases norm overhead
+- Test SORF coordinate distribution at each B: histogram vs. analytical Beta
+- Test 3, 4, 5 SORF rounds at each B
+- Determine if the practical MSE constant is worse at smaller B
+- Measure cross-block coordinate correlation on real embeddings (Contriever,
+  OpenAI) before and after per-block SORF rotation: compute the average
+  absolute Pearson correlation between coordinates in different blocks. Compare
+  block-decomposed (B=256, k=3) vs. single-block (B=d) SORF at d=768 to
+  quantify how much cross-block dependence survives block decomposition. The
+  RotorQuant/IsoQuant experiments [13] showed that very small block-diagonal
+  rotations (3-4 dims) leave full-dimension correlations intact; this test
+  determines where on the block-size spectrum the decorrelation gap becomes
+  negligible
+
+The block-size rule ("greatest qualifying B") is a starting heuristic that
+maximizes per-block quality and minimizes norm count. Experiments may show that
+smaller B with more pruning checkpoints yields better end-to-end scan
+performance despite higher per-block overhead.
+
+### Gaussian-optimal vs. Beta-optimal grids
+
+HIGGS [12] demonstrates that Gaussian-optimal grids (computed via CLVQ for N(0,1))
+work well after a single Hadamard rotation. Since the Beta marginal converges to
+Gaussian at high d, test whether Gaussian grids can replace Beta Max-Lloyd centroids
+for ANN search:
+
+- **Grid comparison**: At B ∈ {64, 128, 256, 512} and b ∈ {2, 3, 4, 5, 8},
+  compare ANN recall@k and normalized MSE for (a) Beta Max-Lloyd centroids at
+  B-dim, (b) Gaussian-optimal scalar grids (Normal Float style), and
+  (c) CLVQ-computed Gaussian grids. Report the crossover point where the grids
+  become practically equivalent.
+- **Rotation depth**: If Gaussian grids match Beta Max-Lloyd at a given B, test
+  whether 1-round RHT (H·D with random signs) achieves comparable quality to
+  3-round SORF. A single round would reduce rotation cost by ~3× and simplify
+  the transform. Test at B ∈ {64, 128, 256, 512} on the benchmarking datasets.
+- **Simplification potential**: If Gaussian grids + 1-round RHT match quality at
+  B ≥ 256, this eliminates the dimension-dependent centroid computation (one grid
+  per bitwidth, shared across all block sizes) and reduces rotation overhead.
+  This would be a significant implementation simplification for Stage 2+.
+
+The expectation is that at B=256+ the difference is negligible, but at B=64-128
+the Beta-optimal grids may still win due to stronger non-Gaussian effects. Results
+should inform whether the centroid computation strategy changes in Phase 2.
+
+### QJL strategy comparison (if pursued)
+
+- Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim SORF QJL
+  vs. MSE-only
+- Key metric: ANN recall@k on the datasets above (Contriever, OpenAI, SIFT)
+- Per community findings for attention, MSE-only is expected to win [8]; ANN
+  ranking is the key open question
+
+### Benchmarking datasets
+
+The current test suite uses i.i.d. Gaussian vectors as a theory anchor and
+sanity check: for isotropic data, a random orthogonal transform is
+distributionally neutral, which cleanly validates theoretical bounds. This is
+not a universal "worst case" for all production workloads — heavy-tailed or
+clustered embeddings can behave differently. Recent work
+(VIBE [11]) argues that traditional benchmarks (SIFT, GloVe) are no longer
+representative of modern ANN workloads.
+
+**Recommended datasets:**
+
+| Dataset                       | Dim    | Size   | Source           | Why                                                                                                                                       |
+| ----------------------------- | ------ | ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
+| Contriever                    | 768    | ~1M    | PDX paper [4]    | Key non-power-of-2 target; real embeddings                                                                                                |
+| OpenAI text-embedding-3-large | 1536   | ~1M    | Common in RAG    | High-d production embeddings                                                                                                              |
+| SIFT                          | 128    | 1M     | Classic          | Low-d power-of-2 baseline, well-studied recall numbers                                                                                    |
+| arXiv embeddings              | 768    | 2.25M  | PDX paper [4]    | Same dim as Contriever, larger scale                                                                                                      |
+| DEEP                          | 96     | 10M    | Image embeddings | Large scale; d=96 < scheme min (128) and has no B ≥ 64 — requires explicit TurboQuantArray construction or benchmark-only scheme override |
+| Synthetic Gaussian            | varies | varies | Internal         | Theory anchor / sanity check; not universal worst case                                                                                    |
+
+**Metrics** (at b_mse ∈ {2, 3, 4, 5, 8}):
+
+- Recall@10, Recall@100 (ANN ranking quality)
+- Normalized MSE distortion (reconstruction quality)
+- Inner product mean signed relative error (bias measurement)
+- Encode/decode throughput (vectors/sec)
+
+The Gaussian baseline validates that theoretical bounds hold. The real-embedding
+datasets measure practical quality — which may be **better** than Gaussian
+(structured data benefits more from rotation) or **worse** (if the data has
+adversarial properties for the specific rotation).
+
+### Dimensions with no qualifying B
+
+Rare for common embedding dimensions (e.g., d=96). Currently these fall back to
+internal zero-padding to the next power-of-2 (single padded block). See
+"Straggler blocks (future work)" in Stage 2 for a potential alternative using
+heterogeneous per-block encodings.
+
+## Phasing
+
+**Phase 1** (in progress) — MSE-only single-block TurboQuant: Initial
+implementation merged as [PR #7269][current-impl]. Remaining:
+`FixedSizeListArray` rotation signs (variable SORF rounds), dtype-matching
+norms, structured metadata, and review items (see Stage 1: Remaining work).
+
+**Phase 2** — Block decomposition: Add block splitting for dimensions where a
+valid B exists (greatest power-of-2 ≥ 64 dividing d). Per-block norms stored as
+internal children. The `TurboQuantScheme::compress()` method must be updated to:
+(a) choose B based on d, (b) split input into blocks, (c) normalize per-block,
+(d) encode each block, and (e) store per-block norms as an internal child array.
+
+**Phase 3** — PDXArray + scan kernels: Introduce `PDXArray` as a general-purpose
+dimension-major layout for `FixedSizeListArray`. TurboQuant's codes child is
+swapped from FSL to PDXArray by the compressor. Distance computation kernels
+operate on PDXArray's dimension-contiguous slices.
+
+**Phase 4** (experimental) — QJL: If the experimental plan shows QJL improves
+recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on
+KV-cache community reports [8], this may not be pursued.
+
+## Practical recommendations
+
+For common model dimensions, the most promising configurations are:
+
+| Dimension              | Recommendation              | Rationale                                                                  |
+| ---------------------- | --------------------------- | -------------------------------------------------------------------------- |
+| 512, 1024, 2048, 4096  | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. |
+| 768, 1536, 3072        | 3-block MSE-only + PDX      | B=256 or 512. No padding waste. 3 blocks, shared centroids.                |
+| No qualifying B (rare) | Padded single-block         | Internal zero-padding to next power-of-2, single SORF.                     |
+
+In all cases, MSE-only is the recommended starting point. QJL should only be
+added if experiments demonstrate clear recall@k improvements for the target
+workload.
+
+## Future work: Multi-dimensional vector quantization (p>1)
+
+HIGGS [12] demonstrates that vector quantization with dimension p>1 (quantizing
+groups of p coordinates jointly to an optimal multi-dimensional grid) achieves
+better rate-distortion than scalar quantization (p=1) at the same bit budget. For
+TurboQuant, this would mean replacing the per-coordinate Max-Lloyd centroid lookup
+with a per-subvector codebook lookup, where each group of p rotated coordinates
+maps to one of n codewords in a p-dimensional CLVQ grid.
+
+**Benefits:**
+
+- Improved rate-distortion: at 3-4 bits, p=2 or p=4 captures residual
+  correlations between coordinates that scalar quantization misses.
+- Simpler centroid computation: CLVQ grids for Gaussian inputs are computed once
+  per (n, p) pair and reused across all block sizes (no dimension dependence).
+
+**Costs and constraints:**
+
+- **Distance kernel redesign.** The PDX scan kernel (Stage 3) is built around
+  per-coordinate centroid lookups with a (2^b)²-entry distance table. At p=2
+  with b=4 bits per coordinate, the codebook has 2^(4×2)=256 entries, and the
+  distance table becomes 256×256=64K entries (256 KB) — still fits in L1/L2 but
+  much larger than the current 1 KB at b=4 scalar. At p=4 the table is
+  infeasible; alternative distance strategies (asymmetric distance computation,
+  partial codebook scans) would be needed.
+- **GPU shared memory.** HIGGS notes total grid points 2^(b×p) must fit GPU
+  shared memory (~2^10 points practical limit), constraining (b, p) pairs.
+- **PDX layout interaction.** The current "1 dim × 64 vecs" PDX layout assumes
+  per-coordinate independence. At p>1, the layout would need to group p
+  consecutive dimensions together per lookup, changing the transpose structure.
+
+**Recommendation:** Evaluate p=2 VQ experimentally after Stage 3 (PDX) is
+validated. Compare ANN recall@k at matched bit budgets: p=1 at b bits vs. p=2 at
+b bits. If p=2 shows meaningful recall improvement (>2% recall@10), design the
+kernel changes as a Stage 4 extension. CLVQ grids for p=2 can be precomputed
+offline using the Pagès & Printems (2003) algorithm [12].
+
+## Future work: GPU decode and fused distance computation
+
+The B-dim block structure maps naturally to GPU tile sizes and tensor cores.
+For a single block (k=1; Stage 2 generalizes to k independent per-block GEMMs)
+with a batch of N vectors sharing the same rotation matrix R⁻¹:
+
+```
+decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes)
+                                      ↑ B×N matrix
+                               ↑ B×B × B×N = GEMM
+```
+
+The codebook gather + inverse rotation + norm scaling can be fused into a single
+kernel using an IO-aware streaming pattern analogous to Flash-KMeans [6] — not
+the same algorithm (Flash-KMeans is GPU k-means), but a similar systems goal:
+reduce HBM traffic and avoid full materialization.
+For distance computation without full decode, a precomputed (2^b_mse)²-entry
+distance table fits in shared memory at low bit widths (1 KB at b_mse=4, 4 KB
+at b_mse=5). At the default b_mse=8, the table is 256² × 4 = 256 KB, which
+exceeds typical GPU shared memory (48-228 KB); the distance-table approach is
+therefore practical only at b ≤ 5 on GPU, or requires tiling/streaming for
+b=8. On CPU, the table fits in L2 at all bit widths. The kernel streams code
+bytes from HBM with gather-reduce accumulation, using 4-8× less bandwidth
+than full float vectors.
+
+At b_mse=8, codes are uint8 indices (0-255). Direct low-precision GEMM on
+hardware accelerators (tensor cores on GPU, byte-dot-product instructions on
+CPU) requires approximately linear
+centroids — but at high B the Max-Lloyd centroids are already near-uniform
+(the Beta distribution is highly concentrated, approaching Gaussian, for which
+high-resolution optimal quantization is approximately uniform). Whether the
+existing Max-Lloyd centroids are "linear enough" for hardware dot-product
+instructions is an empirical question worth testing before introducing a
+separate linear quantization mode.
+
+## Integration with Vortex scan engine
+
+TurboQuant's quantized-domain operations must integrate with Vortex's expression
+evaluation and scan pushdown infrastructure. The current implementation provides
+this via `ScalarFnVTable` implementations in `vortex-tensor`.
+
+**Current integration path.** The `CosineSimilarity`, `DotProduct`, and `L2Norm`
+scalar functions check whether their input storage arrays are TurboQuant-encoded
+(via `TurboQuant::try_match()`). If both operands are TurboQuant and the
+`ApproxOptions::Approximate` flag is set, the scalar function dispatches to the
+quantized-domain kernel (e.g., `cosine_similarity_quantized_column`), bypassing
+full decompression. Otherwise, it falls back to the exact path (decompress →
+compute on floats).
+
+**Stage 2 changes.** With block decomposition, the quantized kernels must be
+updated to iterate over TQ blocks, weighting by per-block norms:
+
+- `cosine_similarity_quantized_column`: currently computes a single unit-norm
+  dot product per row pair. Must change to `Σ_k norm_a_k · norm_b_k ·
+unit_dot_k / (‖a‖ · ‖b‖)` with `‖a‖ = √(Σ_k norm_a_k²)`.
+- `dot_product_quantized_column`: same per-block weighting.
+- `l2_norm`: currently returns the stored norm directly (O(1)). Must change to
+  `√(Σ_k norm_k²)` — read the norms child (`PrimitiveArray` for k=1,
+  `FixedSizeListArray` for k>1) and compute.
+- Both operands must have the **same block size B**, compatible centroids (same
+  `b_mse` and B-dim codebook), and **bit-identical MSE rotation parameters**
+  (`mse_rotation_signs` and same SORF construction) for the quantized
+  inner-product path to be valid. Two stored columns with different rotations
+  must **fall back to exact** (decompress → float). The common **column vs.
+  constant query** path avoids this: the query is re-encoded with the column's
+  rotation and centroids at query time.
+
+**Stage 3 changes.** The PDX distance kernel (shown in Stage 3 pseudocode) is a
+new execution path that operates on `PDXArray`-typed codes. It should be exposed
+as an alternative `ScalarFnVTable` implementation that activates when the codes
+child is a `PDXArray` and the scan is over a contiguous 64-vector-aligned range.
+For non-aligned ranges or single-vector access (`scalar_at`), the PDXArray is
+converted to FSL first via `PDXArray::to_fsl()`.
+
+**Expression tree integration.** The typical ANN scan expression is:
+
+```
+top_k(cosine_similarity(column, constant_query), k=10)
+```
+
+The `constant_query` is broadcast to match the column length. The
+`CosineSimilarity` scalar function receives both the column (TurboQuant-encoded)
+and the query (ConstantArray wrapping a single vector). For the quantized path,
+the query is first encoded with the column's rotation and centroids to produce
+query codes and query block norms, then the PDX kernel runs over the column's
+codes without decompressing them.
+
+## Migration and compatibility
+
+TurboQuant has not been included in a release yet, so the wire format can still
+change freely. The Stage 1 target wire format is intended to be one we believe
+is ready for backward-compatibility guarantees, without formally committing to
+stability until confirmed by Stage 2 implementation and benchmarking.
+
+**Strategy: single array ID, versioned metadata.** All stages use the same array
+ID (`vortex.turboquant`). The metadata includes `block_size`, `num_blocks`, and
+`num_rounds` fields. Stage 1 always writes `num_blocks=1`, but the field exists
+so that Stage 2 decoders can read Stage 1 files without migration.
+
+**Decoder invariant:** `block_size` is always power-of-2.
+`codes.list_size` = `num_blocks × block_size`. Note that `dimension` (the
+original input dimension) may differ from `codes.list_size` in Stage 1 when
+internal padding applies (e.g., dimension=768, block_size=1024, list_size=1024).
+In Stage 2, `dimension = num_blocks × block_size` (no padding, since B is
+chosen to divide d exactly). The decoder **validates** that
+`codes.list_size == num_blocks × block_size` (reject files where this does not
+hold). `num_rounds` must equal `rotation_signs.len / num_blocks`.
+
+**Norms are always internal children.** The TurboQuant array is self-contained —
+it stores norms as a child slot, not in a parent encoding. This means:
+
+- Stage 1: norms child is `PrimitiveArray<F>`, one norm per vector (F = f64
+  for f64 input, f32 otherwise).
+- Stage 2 with k=1 (power-of-2 dims): same as Stage 1, identical wire format.
+- Stage 2 with k>1: norms child is `FixedSizeListArray<F>`, k norms per vector.
+
+The decoder distinguishes k=1 from k>1 by reading `num_blocks` from metadata.
+A k=1 decoder is backward-compatible with Stage 1 files. A k>1 decoder is a
+new code path that only applies to files written by Stage 2+.
+
+**Stage 3 (PDXArray) is additive.** PDX is not a TurboQuant metadata flag — it's
+a separate array type (`PDXArray`) that wraps the codes child. Stage 1/2 files
+have `FixedSizeListArray` codes; Stage 3 files have `PDXArray` codes. The
+TurboQuant decoder checks the child type and un-transposes PDXArray on decode if
+needed. `PDXArray` itself is registered as a new encoding, independent of
+TurboQuant.
+
+**Incremental shipping:**
+
+| Stage      | Ships to users? | Reads prior stage files?   | Notes                              |
+| ---------- | --------------- | -------------------------- | ---------------------------------- |
+| 1 (MSE)    | Yes             | N/A (first stable version) | Single block, variable SORF rounds |
+| 2 (blocks) | Yes             | Yes (k=1 is identical)     | k>1 files need Stage 2+ decoder    |
+| 3 (PDX)    | Yes             | Yes (FSL codes still work) | PDX codes need PDXArray registered |
+
+Each stage is independently shippable. Users can upgrade incrementally. Files
+written by earlier stages are always readable by later decoders.
+
+## References
+
+_All lemma, theorem, and definition numbers for [1] refer to arXiv:2504.19874v1.
+The ICLR 2026 camera-ready proceedings may use different numbering._
+
+[1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online
+Vector Quantization with Near-optimal Distortion Rate." ICLR 2026.
+arXiv:2504.19874, April 2025.
+
+[2] Ailon, N. and Chazelle, B. "The Fast Johnson-Lindenstrauss Transform and
+Approximate Nearest Neighbors." SIAM J. Comput. 39(1):302-322, 2009.
+
+[3] Tropp, J.A. "Improved Analysis of the Subsampled Randomized Hadamard
+Transform." Adv. Adaptive Data Analysis 3(1-2):115-126, 2011.
+
+[4] Kuffo, L., Krippner, E. and Boncz, P. "PDX: A Data Layout for Vector
+Similarity Search." SIGMOD '25. arXiv:2503.04422, March 2025.
+
+[5] Yu, F.X., Suresh, A.T., Choromanski, K., Holtmann-Rice, D. and Kumar, S.
+"Orthogonal Random Features." NeurIPS 2016. arXiv:1610.09072.
+
+[6] Yang, S. et al. "Flash-KMeans: Fast and Memory-Efficient Exact K-Means."
+arXiv:2603.09229, March 2026.
+
+[7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production
+Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0,
+March 2026. https://eviox.tech/nexus/eviox_turboquant_corrections_study.pdf
+_(Note: this URL may require Eviox account access; not publicly indexed.)_
+
+[8] Community TurboQuant implementation reports (primarily KV-cache attention):
+
+- https://github.com/tonbistudio/turboquant-pytorch — MSE-only (V3) vs
+  MSE+QJL (V2); reports MSE-only wins for attention and generation quality.
+- https://github.com/ggml-org/llama.cpp/discussions/20969 — TurboQuant
+  discussion; quantized attention analysis and MSE vs Prod comparison.
+- https://github.com/0xSero/turboquant — Triton kernels; paper validation.
+- https://github.com/scos-lab/turboquant — Reference reproduction; MSE vs
+  Prod/QJL comparison.
+  Multiple groups report MSE-only beating MSE+QJL for attention metrics at tested
+  bit widths. ANN ranking conclusions remain preliminary pending dedicated
+  benchmarks.
+
+[9] Jégou, H., Douze, M. and Schmid, C. "Product Quantization for Nearest
+Neighbor Search." IEEE Trans. PAMI 33(1):117-128, 2011.
+
+[10] Ge, T., He, K., Ke, Q. and Sun, J. "Optimized Product Quantization."
+IEEE Trans. PAMI 36(4):744-755, 2014.
+
+[11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M.
+"VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025.
+
+[12] Malinovskii, V., Panferov, A., Ilin, I., Guo, H., Richtárik, P. and
+Alistarh, D. "Pushing the Limits of Large Language Model Quantization via the
+Linearity Theorem." arXiv:2411.17525, November 2024.
+
+[13] johndpope et al. "RotorQuant: Clifford algebra vector quantization." PR #34,
+TheTom/turboquant_plus, March-April 2026.
+https://github.com/TheTom/turboquant_plus/pull/34
+Explores SO(2)/SO(3)/SO(4) block-diagonal rotations as alternatives to
+full-dimension SORF. Rejected due to 10×+ MSE regressions on real KV-cache
+tensors, attributed to insufficient cross-group decorrelation.
+
+## Appendix A: Reference implementation bugs and Theorem 1 constant
+
+### Reference implementation bugs
+
+The Eviox corrections study [7] identified six material bugs in the paper's
+reference Python implementation. The most critical is a mathematical error in
+the QJL scale factor: the reference code used `√(π/(2d))` instead of
+`√(π/2)/d` (Definition 1 in [1]), differing by a factor of √d (≈11× at d=128).
+Our [current implementation][current-impl] uses the correct formula
+(`sqrt(FRAC_PI_2) / padded_dim` in Rust), so this bug does **not** affect us.
+
+Other notable Eviox findings: (a) the reference code recomputes codebooks at
+every instantiation (we cache in a `DashMap`); (b) the reference uses float16
+for codebook distance computation, causing misassignment at small centroid
+spacings (we cast to f32 before quantization). See [7] for the full list.
+
+### Theorem 1 constant
+
+There is an ambiguity in the paper's notation for the MSE bound constant. The
+formal proof gives `(√3 · π / 2) · 4^{-b}` where the constant √3·π/2 ≈ 2.72.
+The Eviox report [7] (Item 7) deliberately adopts the alternative parsing
+`√(3π)/2 ≈ 1.535`, claiming it is "consistent with the formal proof." We treat
+`√3·π/2 ≈ 2.72` as the theorem constant because: (a) the paper's prose
+describes the constant as "≈ 2.7," which matches 2.72 not 1.535; and (b) the
+paper's reported distortion values (b=2: 0.117, b=3: 0.03) exceed the 1.535-
+based bound (b=2: 0.096, b=3: 0.024), ruling out `√(3π)/2` as a valid
+**upper** bound on the measured quantity. The definitive resolution requires
+checking the exact LaTeX grouping in the ICLR 2026 camera-ready proof. The
+paper's "explicit values" (0.36, 0.117, 0.03, 0.009) are the actual computed
+distortion of the optimal quantizer, not the bound itself — they are well below
+the 2.72/4^b bound.
+
+## Appendix B: Community findings on QJL
+
+Multiple independent TurboQuant implementations have repeatedly reported a
+practical finding for **KV-cache attention**: MSE-only often outperforms MSE+QJL
+at the same bit budget. The likely mechanism is a variance-bias tradeoff: QJL
+removes bias in raw inner-product estimation but adds variance, and the softmax
+nonlinearity amplifies variance more than it penalizes bias. In that setting,
+allocating all bits to MSE (more centroids, lower quantization variance) can beat
+splitting the budget between MSE + QJL. This behavior has been reported by
+multiple groups across Python, C, and Rust implementations [8].
+
+For ANN search, cosine ranking, and other non-softmax vector-search workloads,
+the evidence is currently less settled. MSE-only is still a reasonable default
+because it is simpler and better supported by the current implementation work,
+but the ANN question should be treated as empirical until evaluated on ANN
+datasets with recall@k and ranking metrics (see Experimental plan).
+
+## Appendix C: Alternative rotation strategies
+
+### Why not DCT?
+
+DCT is O(B log B) and invertible, but it is a **fixed structured transform**,
+not a random rotation — it does not produce the Beta marginal distribution
+`(1-x²)^((B-3)/2)` (in block dimension B) that TurboQuant's Max-Lloyd centroids
+are optimized for. ADSampling only needs approximate coordinate independence
+(for hypothesis-testing pruning), so a fixed orthogonal transform like DCT
+suffices there. TurboQuant needs a specific known marginal distribution, so only
+random orthogonal rotations (QR or SORF) are suitable.
+
+### Shared rotation with ADSampling (speculative)
+
+Both TurboQuant and ADSampling apply a random orthogonal rotation to make
+coordinates independent. If we integrate ADSampling-style dimension pruning
+(see Stage 3), the same rotation could in principle serve both purposes.
+However, this is not automatic under the Stage 2 block-decomposed design:
+ADSampling is formulated around a single full-dimensional random projection
+whose coordinates can be sequentially sampled, whereas Stage 2 introduces
+per-block rotations and per-block norm weighting. Reusing one rotation across
+both systems should be treated as a **future research direction** that requires
+new analysis or direct empirical validation. If it proves viable, it would avoid
+rotating the data twice. The query would also need to be rotated at query time
+with the same stored transform.