Reduce CKKS rotation-key footprint via single-giant-step Horner BSGS#1209
Open
pascoec wants to merge 3 commits into
Open
Reduce CKKS rotation-key footprint via single-giant-step Horner BSGS#1209pascoec wants to merge 3 commits into
pascoec wants to merge 3 commits into
Conversation
Reformulate the baby-step/giant-step (BSGS) evaluation of the CoeffsToSlots
and SlotsToCoeffs linear transforms so that the same slot permutations are
realized with far fewer distinct rotation (automorphism) keys. The keys these
transforms need dominate the memory cost of a bootstrapping context, and
EvalBootstrapKeyGen now generates a smaller, more heavily shared set. The
underlying math is unchanged.
Core changes to EvalCoeffsToSlots / EvalSlotsToCoeffs and their precompute:
- Horner single giant-step. Replace the forward outer sum
sum_i Aut_{i*t}(block_i) with the nested Horner form
block_0 + Aut_t(block_1 + Aut_t(...)). Both use the same rotation count, but
Horner needs only one giant-step key per level (stride t = g*scale) instead
of the b-1 distinct keys {t, 2t, ..., (b-1)t}.
- Zero-based hoisted inner rotations. Replace the centered inner baby-step
indices {(j-offset)*sigma} with zero-based {j*sigma}. The j=0 term is now
always rotation 0 and is handled directly by KeySwitchExt (no key), and the
per-level offset delta_s = offset*scale is pushed offline into the
precomputed plaintexts as a pre-rotation.
- Folded per-level corrections. The per-level zero-basing corrections (an
O(levels) set of runtime EvalAtIndex calls in both transforms) are commuted
forward and absorbed into the precomputed plaintexts, leaving a single
accumulated rotation applied once at the end of SlotsToCoeffs. EvalMod is
equivariant under slot rotations, so the CoeffsToSlots correction passes
through it unchanged.
- Split, not combined, accumulated correction. Apply Aut_{-(slots-1)} at the
end of CoeffsToSlots so it always outputs correctly-ordered slots, rather
than deferring a single combined correction to the end of SlotsToCoeffs.
This fixes EvalFBTNoDecoding + EvalHomDecoding (FBT_CONSECLEV), where a user
operation between the two transforms previously saw a residual rotation.
- Sparse-packing index reduction. Reduce all BSGS rotation indices modulo
min(2*slots, M/4). Under sparse packing the precomputed-plaintext vector has
cyclic period 2*slots (the concatenated real/imaginary blocks), not M/4, so
indices reduced only to [0, M/4) could otherwise be inconsistent with the
period-2*slots plaintext pre-rotations. For full and half packing the modulus
equals M/4 and behavior is unchanged.
The single-level linear transform (EvalLinearTransform, used when the level
budget is 1) is converted to the same single-giant-step Horner form, and
FindLinearTransformRotationIndices no longer emits the giant-step keys
{2g, 3g, ..., (h-1)g} that the forward form required.
Supporting cleanups (no behavior change):
- Inline the precomputed rot_in index tables in EvalCoeffsToSlots /
EvalSlotsToCoeffs (compute per level, drop the 2D allocation and a redundant
scale pass).
- Fix the over-large reserve() in the Find*RotationIndices helpers (they
reserved ~M entries for a list of a few hundred).
- Hoist a redundant KeySwitchExt out of the EvalLinearTransform giant-step
loop, and hoist repeated GetParams()/GetElementAtIndex() calls in
ExtendCiphertext.
- Take crypto parameters by const reference and compute the bootstrap scale
factor with std::ldexp.
- Add Doxygen for the transform functions and a note in CKKS_BOOTSTRAPPING.md.
Not included: the SlotsToCoeffs decode-layer re-partition that moves the
remainder group to scale 1 to mirror CoeffsToSlots. The StC remainder still
sits at the last (largest-scale) position, so the additional key sharing that
re-partition enables (e.g. dropping the StC-specific large-scale remainder key)
is not realized here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The baby-step/giant-step (BSGS) linear transforms used throughout CKKS -- the
bootstrapping homomorphic FFT (CoeffsToSlots / SlotsToCoeffs) and the CKKS<->FHEW
scheme-switching transforms -- generate one rotation (automorphism) key per giant
step. These keys dominate the memory footprint of a bootstrapping / scheme-
switching context. This PR reformulates the giant-step accumulation into Horner
form, which needs only a single giant-step key per level instead of (b-1) distinct
keys, then drops the now-unneeded keys from the matching key generation. The
underlying math is unchanged; results are equivalent.
What changed
block_0 + Aut_t(block_1 + Aut_t(...)), using one giant-step key (t = gscale)
instead of {2t, 3t, ..., (b-1)t}. Applied to every BSGS transform in both
subsystems.
KeySwitchExt, no key); the per-level offset is pushed offline into the
precomputed plaintexts.
the end of SlotsToCoeffs, replacing the previous O(levels) EvalAtIndex calls;
split so CoeffsToSlots still emits correctly-ordered slots (fixes FBT_CONSECLEV).
consistent with the period-(2slots) plaintext pre-rotations (no-op for full and
half packing).
Indices, FindLinearTransformRotationIndices, FindLTRotationIndicesSwitch, and
FindLTRotationIndicesSwitchArgmin no longer emit the giant-step keys.
Results
Rotation-key count, dev vs. this PR, CKKS bootstrapping (ring 2^16, UNIFORM,
budget {3,3}):
The reduction grows with sparsity (more giant steps relative to the work).
Scheme switching uses the identical technique for the same per-transform saving.
Not included
The SlotsToCoeffs decode-layer re-partition (moving the remainder group to scale 1
to mirror CoeffsToSlots) is intentionally left out -- the StC remainder still sits
at the largest-scale position, so the additional key sharing that enables is a
follow-up.