fix(p2p)!: resolve checkpoint tips from stored ids and fail loudly on corruption#23968
Open
spalladino wants to merge 7 commits into
Open
fix(p2p)!: resolve checkpoint tips from stored ids and fail loudly on corruption#23968spalladino wants to merge 7 commits into
spalladino wants to merge 7 commits into
Conversation
af77cdd to
2cbffe8
Compare
Base automatically changed from
spl/fix-checkpoint-zero-l2-tips-store
to
merge-train/spartan-v5
June 9, 2026 20:47
2cbffe8 to
b02c3e0
Compare
7386e22 to
32f2b37
Compare
spalladino
commented
Jun 10, 2026
Comment on lines
+746
to
+748
| // The local tips store never leads the checkpointed tip, so target-slot preparation (gated on a | ||
| // proposed checkpoint ahead of checkpointed) was already unreachable here; always prepare the current slot. | ||
| const { currentSlot: slot } = this.epochCache.getCurrentAndNextSlot(); |
PhilWindle
pushed a commit
that referenced
this pull request
Jun 10, 2026
…all clock (#23978) The prepare-for-slot loop in p2p client was **not** synced with the `L2BlockStream` events, meanining the `unprotect` call could trigger before the blocks-added flagged the txs as mined. One solution could've been to add a new event to the blockstream on `slot-synced`, but it's easier to just remove the polling, and unprotect slots when a block proposal that protected the txs fails. As a safeguard, we still call unprotect based on slot numbers on mined blocks. ## Problem The tx pool **protects** txs referenced by an in-flight block proposal: on gossip receipt, the proposal's txs are keyed to its slot and removed from the pending indices, so the local builder cannot re-select them and eviction cannot drop them while the proposal may still land. `prepareForSlot(S)` releases protections from slots before `S`, revalidates the txs, and returns them to pending. Release was driven by a wall-clock slot monitor polling the epoch cache every tick. Three problems: - **Race against mined-marking.** The monitor can fire after a proposal's checkpoint lands on L1 but before the block stream delivers `blocks-added`. The just-landed txs are unprotected into pending, where eviction or nullifier-conflict resolution can delete them; when the block then syncs there is nothing left to mark mined, and after a later reorg `handlePrunedBlocks` has nothing to restore — the tx is lost to the pool. - **Clock dependency.** The epoch cache is wall-clock derived and explicitly depends on system clock sync; unprotection correctness should depend on observed chain state instead. - **Pipelining blind spot.** Gossiped proposals carry future target slots during proposer pipelining, so wall-clock release frees them late. (The old target-slot branch that tried to address this read `proposedCheckpoint` from the local tips store, where it can never lead the checkpointed tip — removed in #23968 as dead code.) ## Change The protection lifecycle becomes fully event-driven and the slot monitor is deleted: - **Protect** on gossip receipt of a block proposal (unchanged). - **Release on local validation failure**: a proposal that fails validation immediately releases the protections it created — only entries still keyed to that proposal's slot, so a tx also referenced by a live proposal at another slot stays protected. - **Resolve via chain events** (unchanged): `blocks-added` marks txs mined, superseding protection; `chain-pruned` un-mines them back to pending. Proposals that landed as proposed-but-unconfirmed checkpoints and are later orphaned are fully handled by this existing lifecycle. - **Collect silent deaths via synced block slots**: `prepareForSlot` now runs inside the `blocks-added` handler with the slot of the last synced block, after mined-marking. Any block landing at slot S releases protections from all earlier slots — covering proposals that never reached L1 at all (no quorum, proposer crash, dropped L1 tx), for which no chain event can ever fire. Because it is ordered after mined-marking in the same handler, the unprotect-before-mined race is impossible by construction. - **Proposers are unaffected**: the sequencer already calls `prepareForSlot(targetSlot)` directly before building, which remains the one legitimate ahead-of-chain preparation. ## Trade-off accepted During a multi-slot stall with no blocks landing anywhere, non-proposer pools retain protections until the first block lands (the wall clock used to release them mid-stall). There is no user-visible cost — there is no chain to include txs in during a stall — and the memory held is bounded and self-healing on the first `blocks-added`. Proposers self-serve via the direct sequencer call. Protections are in-memory only, so a restart clears them. Fixes A-1173
Collaborator
Flakey Tests🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
…corruption No L2Tips provider may silently report checkpoint zero for a cursor that points at a real (non-genesis) block. getCheckpointId now throws for all four checkpoint-bearing cursors (checkpointed, proposedCheckpoint, proven, finalized) when a real block has no resolvable checkpoint, but only on genuine store corruption — never on a legitimate skipped-history prune. - Carry a checkpoint id in chain-proven and chain-finalized events (populated from sourceTips), so a cursor that legitimately leads the locally-checkpointed frontier can still resolve its checkpoint. - Store a checkpoint id per cursor (new l2_tip_checkpoints map in the KV store; a Map in the memory store). It is written wherever a checkpoint-bearing cursor advances: chain-checkpointed (checkpointed and, when advanced, proposedCheckpoint), chain-proven, chain-finalized. - In handleChainPruned, for each cursor clamped backward, resolve a consistent id from the local block->checkpoint mapping; when the prune target is a synced-but-unmapped block (skipped history) clamp that cursor to genesis instead, so it re-syncs rather than sitting on an id-less block. The prune handler never throws. - getCheckpointId order: genesis -> stored per-cursor id -> block->checkpoint mapping (back-compat for stores written before per-cursor ids) -> throw. Because the writers never leave a cursor on an id-less real block, the throw is reachable only on genuine corruption. The pxe l2_tip_checkpoints sub-store is read-defaultable: existing on-disk DBs start with it empty, getTipCheckpoint returns undefined, and resolution falls back to the block->checkpoint mapping, so no PXE_DATA_SCHEMA_VERSION bump is needed. World-state is intentionally left unchanged: its merkle trees live in a native store with no shared TS transaction, so checkpoint tracking cannot be updated atomically with merkle commits/unwinds without risking drift, and no consumer reads world-state checkpoint ids for correctness.
…ct store from the tips store Now that every checkpoint-bearing tip in L2TipsStoreBase resolves its CheckpointId from a per-tip stored id (getTipCheckpoint/setTipCheckpoint), the block->checkpoint number mapping and the checkpoint-object store are redundant: they existed only to feed getCheckpointId. Remove both backing maps from the KV and memory stores along with their abstract methods, and collapse getCheckpointId to return the genesis id for block 0 and the stored per-tip id otherwise, throwing on a real block with no recorded id. The chain-pruned event now carries the source's confirmed checkpointed tip (checkpointed: L2TipId) instead of a bare CheckpointId. The prune handler clamps any checkpoint-bearing cursor that leads that tip down to it, always landing on a block with a valid recorded id; there is no longer a genesis-clamp or mapping lookup. handleChainFinalized prunes only block hashes below the lowest live tip. This is a breaking storage change (two PXE sub-stores removed); PXE_DATA_SCHEMA_VERSION is bumped so DatabaseVersionManager wipes pre-existing DBs rather than reading stale layouts.
…ider The local L2-block-stream tips providers (world-state, l2-tips-store) no longer carry proposedCheckpoint. In those local stores it is degenerate (always equal to checkpointed) and no consumer reads it: the only reader, in p2p_client's maybeCallPrepareForSlot, was a dead branch that was always false because the local store never leads the checkpointed tip. L2TipsProvider.getL2Tips now returns LocalL2Tips (Omit<L2Tips, 'proposedCheckpoint'>), the L2TipsStoreBase stops maintaining the cursor, and the dead p2p reader is removed. L2BlockSource.getL2Tips is a separate interface and keeps the full L2Tips with proposedCheckpoint, which the sequencer and node still read from the archiver.
… layout The L2 tips store dropped the persisted block->checkpoint mapping and now resolves checkpoint tips exclusively from per-tip checkpoint ids. An upgraded store that kept its old tips with an empty l2_tip_checkpoints map would make getL2Tips throw on every read with no way to self-heal, failing P2PClient startup. Bumping the schema version resets the store on upgrade.
The chain-pruned event carried only the source checkpointed tip, so a prune that rolled back the proven chain clamped the local proven cursor onto the checkpointed tip, transiently reporting unproven blocks as proven until the corrective chain-proven event landed at the end of the same sync iteration. Carry the source proven tip in the event and clamp each cursor to its own tip.
…tate tips provider The block stream's local data provider demanded full checkpoint-bearing tips, forcing world-state to fabricate genesis checkpoint ids and a checkpointed tip at block zero that violated the finalized <= checkpointed invariant. Narrow the provider contract to the tips the stream actually reads (checkpointed is only required when emitting checkpoint events) and have the stream fail loudly if checkpoint emission is enabled without one. World-state now reports only the proposed, proven, and finalized blocks it genuinely tracks; the full LocalL2Tips shape remains on L2TipsProvider for the p2p and pxe tips stores.
…aring events The CheckpointStore redesign landed on the merge train a prover-node consumer of chain-pruned written against the old event shape while this branch reshaped it: chain-pruned now carries checkpointed/proven tips instead of a bare checkpoint id, and chain-proven/chain-finalized carry checkpoint ids. Read the pruned checkpoint from event.checkpointed.checkpoint (same semantics as the old field) and update the test event constructors, including a hash() on the fake Checkpoint now that the tips store records checkpoint ids on checkpoint events.
32f2b37 to
15cc304
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
#23967 stopped the checkpoint-replay storm at its source (a prune must not advance checkpoint-bearing cursors onto an uncheckpointed block). This PR makes the local tips store never silently report checkpoint zero for a real block, removes the machinery and degenerate fields that made the silent path possible, and closes the gaps found in review: a store-upgrade path that would brick p2p nodes, a transient proven-tip overshoot on prune, and world-state fabricating checkpoint ids it never had.
Fixes A-1174
Commits
1.
fix(p2p): resolve checkpoint tips from stored ids and fail loudly on corruptionchain-proven/chain-finalizedcarry theirCheckpointId; the store records a checkpoint id per cursor.getCheckpointIdresolves genesis → stored id → (back-compat)block→checkpointmapping → throw, so a real-block cursor with no resolvable checkpoint fails loudly instead of degrading to a replay.proven/finalizedcursor can legitimately lead the locally-checkpointed frontier (batch lag /startingBlock); carrying the id lets it resolve without a local mapping, so the throw only fires on genuine corruption (never on a legitimate skipped-history prune).2.
refactor(p2p): drop the block→checkpoint mapping and checkpoint-object storeblock→checkpointmapping and the checkpoint-object store are dead weight (they existed only to feedgetCheckpointId). Both backing maps are removed from the KV and memory stores.chain-prunednow carriescheckpointed: L2TipId(the source's confirmed checkpointed tip) instead of a bareCheckpointId. The prune handler clamps any checkpoint-bearing cursor that leads that tip down to it, always landing on a block with a recorded id — no genesis-clamp, no mapping lookup.handleChainFinalizedcollapses to pruning block hashes below the lowest live tip.3.
refactor(stdlib): drop proposedCheckpoint from the local L2 tips providerproposedCheckpointis degenerate in the local stores (always equal tocheckpointed) and no consumer reads it — the one reader,p2p_client'smaybeCallPrepareForSlot, was a dead branch (always false).L2TipsProvider.getL2Tipsnow returnsLocalL2Tips = Omit<L2Tips, 'proposedCheckpoint'>; the local stores stop maintaining the cursor; the dead p2p branch is removed.L2BlockSource.getL2Tipsis a separate interface and keeps the fullL2TipswithproposedCheckpoint, which the sequencer and node still read from the archiver.4.
fix(p2p): bump p2p store schema version for the per-tip checkpoint id layoutgetL2Tipsthrow on every read with no way to self-heal — failingP2PClient.start()outright. Bumping the store schema version (3 → 4) resets the store on upgrade instead.5.
fix(p2p): clamp the proven tip to the source proven tip on prunechain-prunedcarried only the checkpointed tip, so a prune that rolled back the proven chain clamped the local proven cursor onto the (higher) checkpointed tip, transiently reporting unproven blocks as proven until the correctivechain-provenevent landed at the end of the same sync iteration. The event now also carriesproven: L2TipIdand each cursor clamps to its own source tip.6.
refactor(world-state): stop fabricating checkpoint ids in the world-state tips providercheckpointedtip at block zero (violatingfinalized ≤ checkpointed) that nothing ever read. The provider contract is narrowed toLocalChainTips— the tips the stream actually consumes, withcheckpointedrequired only when emitting checkpoint events — and the stream fails loudly if checkpoint emission is enabled without one. World-state now reports only the proposed/proven/finalized blocks it genuinely tracks;L2TipsProviderkeeps the fullLocalL2Tipsshape for the p2p/pxe tips stores.7.
fix(prover-node): consume the reshaped chain-pruned and checkpoint-bearing eventschain-pruned(written against the old event shape) on the merge train while this branch was in flight. The prune handler now readsevent.checkpointed.checkpoint(same semantics as the old field) and the test event constructors are updated to the new shapes.Breaking / operational notes
PXE_DATA_SCHEMA_VERSIONbumped: existing PXE DBs are wiped and resync on first open.L2BlockStreamEventshape changes (internal API):chain-proven/chain-finalizedgaincheckpoint;chain-prunedcarriescheckpointed/proventips instead of a barecheckpoint.Deferred
maybeCallPrepareForSlot's target-slot preparation (prepare the target slot when a proposed checkpoint exists) never worked, because the local store cannot represent a proposed checkpoint ahead of the checkpointed tip. This is handled in #23978.