Skip to content

Reduce CKKS rotation-key footprint via single-giant-step Horner BSGS#1209

Open
pascoec wants to merge 3 commits into
devfrom
transform-optimizations
Open

Reduce CKKS rotation-key footprint via single-giant-step Horner BSGS#1209
pascoec wants to merge 3 commits into
devfrom
transform-optimizations

Conversation

@pascoec

@pascoec pascoec commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

The baby-step/giant-step (BSGS) linear transforms used throughout CKKS -- the
bootstrapping homomorphic FFT (CoeffsToSlots / SlotsToCoeffs) and the CKKS<->FHEW
scheme-switching transforms -- generate one rotation (automorphism) key per giant
step. These keys dominate the memory footprint of a bootstrapping / scheme-
switching context. This PR reformulates the giant-step accumulation into Horner
form, which needs only a single giant-step key per level instead of (b-1) distinct
keys, then drops the now-unneeded keys from the matching key generation. The
underlying math is unchanged; results are equivalent.

What changed

  • Horner giant-step accumulation: sum_i Aut_{it}(block_i) becomes
    block_0 + Aut_t(block_1 + Aut_t(...)), using one giant-step key (t = g
    scale)
    instead of {2t, 3t, ..., (b-1)t}. Applied to every BSGS transform in both
    subsystems.
  • Zero-based inner indices: the j=0 baby term maps to rotation 0 (handled by
    KeySwitchExt, no key); the per-level offset is pushed offline into the
    precomputed plaintexts.
  • Folded per-level corrections into a single accumulated rotation applied once at
    the end of SlotsToCoeffs, replacing the previous O(levels) EvalAtIndex calls;
    split so CoeffsToSlots still emits correctly-ordered slots (fixes FBT_CONSECLEV).
  • Sparse-packing index reduction modulo min(2slots, M/4), keeping automorphisms
    consistent with the period-(2
    slots) plaintext pre-rotations (no-op for full and
    half packing).
  • Key-gen reduction: FindCoeffsToSlotsRotationIndices, FindSlotsToCoeffsRotation-
    Indices, FindLinearTransformRotationIndices, FindLTRotationIndicesSwitch, and
    FindLTRotationIndicesSwitchArgmin no longer emit the giant-step keys.

Results

Rotation-key count, dev vs. this PR, CKKS bootstrapping (ring 2^16, UNIFORM,
budget {3,3}):

packing keys.dev keys.new reduction
full 54 49 1.10x
1/4 88 63 1.40x
1/16 81 57 1.42x
1/64 50 29 1.72x

The reduction grows with sparsity (more giant steps relative to the work).
Scheme switching uses the identical technique for the same per-transform saving.

Not included

The SlotsToCoeffs decode-layer re-partition (moving the remainder group to scale 1
to mirror CoeffsToSlots) is intentionally left out -- the StC remainder still sits
at the largest-scale position, so the additional key sharing that enables is a
follow-up.

pascoec added 2 commits June 24, 2026 11:52
Reformulate the baby-step/giant-step (BSGS) evaluation of the CoeffsToSlots
and SlotsToCoeffs linear transforms so that the same slot permutations are
realized with far fewer distinct rotation (automorphism) keys. The keys these
transforms need dominate the memory cost of a bootstrapping context, and
EvalBootstrapKeyGen now generates a smaller, more heavily shared set. The
underlying math is unchanged.

Core changes to EvalCoeffsToSlots / EvalSlotsToCoeffs and their precompute:

- Horner single giant-step. Replace the forward outer sum
  sum_i Aut_{i*t}(block_i) with the nested Horner form
  block_0 + Aut_t(block_1 + Aut_t(...)). Both use the same rotation count, but
  Horner needs only one giant-step key per level (stride t = g*scale) instead
  of the b-1 distinct keys {t, 2t, ..., (b-1)t}.

- Zero-based hoisted inner rotations. Replace the centered inner baby-step
  indices {(j-offset)*sigma} with zero-based {j*sigma}. The j=0 term is now
  always rotation 0 and is handled directly by KeySwitchExt (no key), and the
  per-level offset delta_s = offset*scale is pushed offline into the
  precomputed plaintexts as a pre-rotation.

- Folded per-level corrections. The per-level zero-basing corrections (an
  O(levels) set of runtime EvalAtIndex calls in both transforms) are commuted
  forward and absorbed into the precomputed plaintexts, leaving a single
  accumulated rotation applied once at the end of SlotsToCoeffs. EvalMod is
  equivariant under slot rotations, so the CoeffsToSlots correction passes
  through it unchanged.

- Split, not combined, accumulated correction. Apply Aut_{-(slots-1)} at the
  end of CoeffsToSlots so it always outputs correctly-ordered slots, rather
  than deferring a single combined correction to the end of SlotsToCoeffs.
  This fixes EvalFBTNoDecoding + EvalHomDecoding (FBT_CONSECLEV), where a user
  operation between the two transforms previously saw a residual rotation.

- Sparse-packing index reduction. Reduce all BSGS rotation indices modulo
  min(2*slots, M/4). Under sparse packing the precomputed-plaintext vector has
  cyclic period 2*slots (the concatenated real/imaginary blocks), not M/4, so
  indices reduced only to [0, M/4) could otherwise be inconsistent with the
  period-2*slots plaintext pre-rotations. For full and half packing the modulus
  equals M/4 and behavior is unchanged.

The single-level linear transform (EvalLinearTransform, used when the level
budget is 1) is converted to the same single-giant-step Horner form, and
FindLinearTransformRotationIndices no longer emits the giant-step keys
{2g, 3g, ..., (h-1)g} that the forward form required.

Supporting cleanups (no behavior change):
- Inline the precomputed rot_in index tables in EvalCoeffsToSlots /
  EvalSlotsToCoeffs (compute per level, drop the 2D allocation and a redundant
  scale pass).
- Fix the over-large reserve() in the Find*RotationIndices helpers (they
  reserved ~M entries for a list of a few hundred).
- Hoist a redundant KeySwitchExt out of the EvalLinearTransform giant-step
  loop, and hoist repeated GetParams()/GetElementAtIndex() calls in
  ExtendCiphertext.
- Take crypto parameters by const reference and compute the bootstrap scale
  factor with std::ldexp.
- Add Doxygen for the transform functions and a note in CKKS_BOOTSTRAPPING.md.

Not included: the SlotsToCoeffs decode-layer re-partition that moves the
remainder group to scale 1 to mirror CoeffsToSlots. The StC remainder still
sits at the last (largest-scale) position, so the additional key sharing that
re-partition enables (e.g. dropping the StC-specific large-scale remainder key)
is not realized here.
@pascoec pascoec added this to the Release 1.6.0 milestone Jun 24, 2026
@pascoec pascoec self-assigned this Jun 24, 2026
@pascoec pascoec added cleanup Code cleanup optimization Improves performance labels Jun 24, 2026
@pascoec pascoec requested a review from yspolyakov June 25, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cleanup Code cleanup optimization Improves performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement CKKS bootstrapping linear transform optimizations from FIDESlib

1 participant