Skip to content

fix(p2p)!: resolve checkpoint tips from stored ids and fail loudly on corruption#23968

Open
spalladino wants to merge 7 commits into
merge-train/spartan-v5from
spl/fix-tips-checkpoint-ids-fail-loud
Open

fix(p2p)!: resolve checkpoint tips from stored ids and fail loudly on corruption#23968
spalladino wants to merge 7 commits into
merge-train/spartan-v5from
spl/fix-tips-checkpoint-ids-fail-loud

Conversation

@spalladino

@spalladino spalladino commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Motivation

#23967 stopped the checkpoint-replay storm at its source (a prune must not advance checkpoint-bearing cursors onto an uncheckpointed block). This PR makes the local tips store never silently report checkpoint zero for a real block, removes the machinery and degenerate fields that made the silent path possible, and closes the gaps found in review: a store-upgrade path that would brick p2p nodes, a transient proven-tip overshoot on prune, and world-state fabricating checkpoint ids it never had.

Fixes A-1174

Commits

1. fix(p2p): resolve checkpoint tips from stored ids and fail loudly on corruption

  • chain-proven/chain-finalized carry their CheckpointId; the store records a checkpoint id per cursor. getCheckpointId resolves genesis → stored id → (back-compat) block→checkpoint mapping → throw, so a real-block cursor with no resolvable checkpoint fails loudly instead of degrading to a replay.
  • A proven/finalized cursor can legitimately lead the locally-checkpointed frontier (batch lag / startingBlock); carrying the id lets it resolve without a local mapping, so the throw only fires on genuine corruption (never on a legitimate skipped-history prune).

2. refactor(p2p): drop the block→checkpoint mapping and checkpoint-object store

  • With per-cursor stored ids, the block→checkpoint mapping and the checkpoint-object store are dead weight (they existed only to feed getCheckpointId). Both backing maps are removed from the KV and memory stores.
  • chain-pruned now carries checkpointed: L2TipId (the source's confirmed checkpointed tip) instead of a bare CheckpointId. The prune handler clamps any checkpoint-bearing cursor that leads that tip down to it, always landing on a block with a recorded id — no genesis-clamp, no mapping lookup. handleChainFinalized collapses to pruning block hashes below the lowest live tip.

3. refactor(stdlib): drop proposedCheckpoint from the local L2 tips provider

  • proposedCheckpoint is degenerate in the local stores (always equal to checkpointed) and no consumer reads it — the one reader, p2p_client's maybeCallPrepareForSlot, was a dead branch (always false). L2TipsProvider.getL2Tips now returns LocalL2Tips = Omit<L2Tips, 'proposedCheckpoint'>; the local stores stop maintaining the cursor; the dead p2p branch is removed.
  • L2BlockSource.getL2Tips is a separate interface and keeps the full L2Tips with proposedCheckpoint, which the sequencer and node still read from the archiver.

4. fix(p2p): bump p2p store schema version for the per-tip checkpoint id layout

  • An upgraded p2p store would keep its old tips with an empty per-cursor id map, making getL2Tips throw on every read with no way to self-heal — failing P2PClient.start() outright. Bumping the store schema version (3 → 4) resets the store on upgrade instead.

5. fix(p2p): clamp the proven tip to the source proven tip on prune

  • chain-pruned carried only the checkpointed tip, so a prune that rolled back the proven chain clamped the local proven cursor onto the (higher) checkpointed tip, transiently reporting unproven blocks as proven until the corrective chain-proven event landed at the end of the same sync iteration. The event now also carries proven: L2TipId and each cursor clamps to its own source tip.

6. refactor(world-state): stop fabricating checkpoint ids in the world-state tips provider

  • The stream's local-data-provider contract demanded full checkpoint-bearing tips, forcing world-state to hardcode genesis checkpoint ids and a checkpointed tip at block zero (violating finalized ≤ checkpointed) that nothing ever read. The provider contract is narrowed to LocalChainTips — the tips the stream actually consumes, with checkpointed required only when emitting checkpoint events — and the stream fails loudly if checkpoint emission is enabled without one. World-state now reports only the proposed/proven/finalized blocks it genuinely tracks; L2TipsProvider keeps the full LocalL2Tips shape for the p2p/pxe tips stores.

7. fix(prover-node): consume the reshaped chain-pruned and checkpoint-bearing events

  • The CheckpointStore redesign landed a prover-node consumer of chain-pruned (written against the old event shape) on the merge train while this branch was in flight. The prune handler now reads event.checkpointed.checkpoint (same semantics as the old field) and the test event constructors are updated to the new shapes.

Breaking / operational notes

  • PXE_DATA_SCHEMA_VERSION bumped: existing PXE DBs are wiped and resync on first open.
  • p2p store schema version bumped: existing p2p data dirs (including the tx pool) are wiped and resync on upgrade.
  • L2BlockStreamEvent shape changes (internal API): chain-proven/chain-finalized gain checkpoint; chain-pruned carries checkpointed/proven tips instead of a bare checkpoint.

Deferred

maybeCallPrepareForSlot's target-slot preparation (prepare the target slot when a proposed checkpoint exists) never worked, because the local store cannot represent a proposed checkpoint ahead of the checkpointed tip. This is handled in #23978.

@spalladino spalladino force-pushed the spl/fix-tips-checkpoint-ids-fail-loud branch from af77cdd to 2cbffe8 Compare June 9, 2026 20:24
Base automatically changed from spl/fix-checkpoint-zero-l2-tips-store to merge-train/spartan-v5 June 9, 2026 20:47
@spalladino spalladino force-pushed the spl/fix-tips-checkpoint-ids-fail-loud branch from 2cbffe8 to b02c3e0 Compare June 9, 2026 21:11
@spalladino spalladino force-pushed the spl/fix-tips-checkpoint-ids-fail-loud branch from 7386e22 to 32f2b37 Compare June 10, 2026 15:52
Comment on lines +746 to +748
// The local tips store never leads the checkpointed tip, so target-slot preparation (gated on a
// proposed checkpoint ahead of checkpointed) was already unreachable here; always prepare the current slot.
const { currentSlot: slot } = this.epochCache.getCurrentAndNextSlot();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going away in #23978 anyway

@spalladino spalladino changed the title fix(p2p): resolve checkpoint tips from stored ids and fail loudly on corruption fix(p2p)!: resolve checkpoint tips from stored ids and fail loudly on corruption Jun 10, 2026
PhilWindle pushed a commit that referenced this pull request Jun 10, 2026
…all clock (#23978)

The prepare-for-slot loop in p2p client was **not** synced with the
`L2BlockStream` events, meanining the `unprotect` call could trigger
before the blocks-added flagged the txs as mined.

One solution could've been to add a new event to the blockstream on
`slot-synced`, but it's easier to just remove the polling, and unprotect
slots when a block proposal that protected the txs fails. As a
safeguard, we still call unprotect based on slot numbers on mined
blocks.

## Problem

The tx pool **protects** txs referenced by an in-flight block proposal:
on gossip receipt, the proposal's txs are keyed to its slot and removed
from the pending indices, so the local builder cannot re-select them and
eviction cannot drop them while the proposal may still land.
`prepareForSlot(S)` releases protections from slots before `S`,
revalidates the txs, and returns them to pending.

Release was driven by a wall-clock slot monitor polling the epoch cache
every tick. Three problems:

- **Race against mined-marking.** The monitor can fire after a
proposal's checkpoint lands on L1 but before the block stream delivers
`blocks-added`. The just-landed txs are unprotected into pending, where
eviction or nullifier-conflict resolution can delete them; when the
block then syncs there is nothing left to mark mined, and after a later
reorg `handlePrunedBlocks` has nothing to restore — the tx is lost to
the pool.
- **Clock dependency.** The epoch cache is wall-clock derived and
explicitly depends on system clock sync; unprotection correctness should
depend on observed chain state instead.
- **Pipelining blind spot.** Gossiped proposals carry future target
slots during proposer pipelining, so wall-clock release frees them late.
(The old target-slot branch that tried to address this read
`proposedCheckpoint` from the local tips store, where it can never lead
the checkpointed tip — removed in #23968 as dead code.)

## Change

The protection lifecycle becomes fully event-driven and the slot monitor
is deleted:

- **Protect** on gossip receipt of a block proposal (unchanged).
- **Release on local validation failure**: a proposal that fails
validation immediately releases the protections it created — only
entries still keyed to that proposal's slot, so a tx also referenced by
a live proposal at another slot stays protected.
- **Resolve via chain events** (unchanged): `blocks-added` marks txs
mined, superseding protection; `chain-pruned` un-mines them back to
pending. Proposals that landed as proposed-but-unconfirmed checkpoints
and are later orphaned are fully handled by this existing lifecycle.
- **Collect silent deaths via synced block slots**: `prepareForSlot` now
runs inside the `blocks-added` handler with the slot of the last synced
block, after mined-marking. Any block landing at slot S releases
protections from all earlier slots — covering proposals that never
reached L1 at all (no quorum, proposer crash, dropped L1 tx), for which
no chain event can ever fire. Because it is ordered after mined-marking
in the same handler, the unprotect-before-mined race is impossible by
construction.
- **Proposers are unaffected**: the sequencer already calls
`prepareForSlot(targetSlot)` directly before building, which remains the
one legitimate ahead-of-chain preparation.

## Trade-off accepted

During a multi-slot stall with no blocks landing anywhere, non-proposer
pools retain protections until the first block lands (the wall clock
used to release them mid-stall). There is no user-visible cost — there
is no chain to include txs in during a stall — and the memory held is
bounded and self-healing on the first `blocks-added`. Proposers
self-serve via the direct sequencer call. Protections are in-memory
only, so a restart clears them.

Fixes A-1173
@AztecBot

Copy link
Copy Markdown
Collaborator

Flakey Tests

🤖 says: This CI run detected 1 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/e7a837fba96bbcaa�e7a837fba96bbcaa8;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_p2p/sentinel_status_slash.parallel.test.ts "slashes the proposer with INACTIVITY when checkpoint validation records invalid" (203s) (code: 0) group:e2e-p2p-epoch-flakes

…corruption

No L2Tips provider may silently report checkpoint zero for a cursor that
points at a real (non-genesis) block. getCheckpointId now throws for all
four checkpoint-bearing cursors (checkpointed, proposedCheckpoint, proven,
finalized) when a real block has no resolvable checkpoint, but only on
genuine store corruption — never on a legitimate skipped-history prune.

- Carry a checkpoint id in chain-proven and chain-finalized events
  (populated from sourceTips), so a cursor that legitimately leads the
  locally-checkpointed frontier can still resolve its checkpoint.
- Store a checkpoint id per cursor (new l2_tip_checkpoints map in the KV
  store; a Map in the memory store). It is written wherever a
  checkpoint-bearing cursor advances: chain-checkpointed (checkpointed and,
  when advanced, proposedCheckpoint), chain-proven, chain-finalized.
- In handleChainPruned, for each cursor clamped backward, resolve a
  consistent id from the local block->checkpoint mapping; when the prune
  target is a synced-but-unmapped block (skipped history) clamp that cursor
  to genesis instead, so it re-syncs rather than sitting on an id-less
  block. The prune handler never throws.
- getCheckpointId order: genesis -> stored per-cursor id -> block->checkpoint
  mapping (back-compat for stores written before per-cursor ids) -> throw.
  Because the writers never leave a cursor on an id-less real block, the
  throw is reachable only on genuine corruption.

The pxe l2_tip_checkpoints sub-store is read-defaultable: existing on-disk
DBs start with it empty, getTipCheckpoint returns undefined, and resolution
falls back to the block->checkpoint mapping, so no PXE_DATA_SCHEMA_VERSION
bump is needed.

World-state is intentionally left unchanged: its merkle trees live in a
native store with no shared TS transaction, so checkpoint tracking cannot be
updated atomically with merkle commits/unwinds without risking drift, and no
consumer reads world-state checkpoint ids for correctness.
…ct store from the tips store

Now that every checkpoint-bearing tip in L2TipsStoreBase resolves its CheckpointId from a per-tip
stored id (getTipCheckpoint/setTipCheckpoint), the block->checkpoint number mapping and the
checkpoint-object store are redundant: they existed only to feed getCheckpointId. Remove both backing
maps from the KV and memory stores along with their abstract methods, and collapse getCheckpointId to
return the genesis id for block 0 and the stored per-tip id otherwise, throwing on a real block with no
recorded id.

The chain-pruned event now carries the source's confirmed checkpointed tip (checkpointed: L2TipId)
instead of a bare CheckpointId. The prune handler clamps any checkpoint-bearing cursor that leads that
tip down to it, always landing on a block with a valid recorded id; there is no longer a genesis-clamp
or mapping lookup. handleChainFinalized prunes only block hashes below the lowest live tip.

This is a breaking storage change (two PXE sub-stores removed); PXE_DATA_SCHEMA_VERSION is bumped so
DatabaseVersionManager wipes pre-existing DBs rather than reading stale layouts.
…ider

The local L2-block-stream tips providers (world-state, l2-tips-store) no longer
carry proposedCheckpoint. In those local stores it is degenerate (always equal to
checkpointed) and no consumer reads it: the only reader, in p2p_client's
maybeCallPrepareForSlot, was a dead branch that was always false because the local
store never leads the checkpointed tip.

L2TipsProvider.getL2Tips now returns LocalL2Tips (Omit<L2Tips, 'proposedCheckpoint'>),
the L2TipsStoreBase stops maintaining the cursor, and the dead p2p reader is removed.
L2BlockSource.getL2Tips is a separate interface and keeps the full L2Tips with
proposedCheckpoint, which the sequencer and node still read from the archiver.
… layout

The L2 tips store dropped the persisted block->checkpoint mapping and now
resolves checkpoint tips exclusively from per-tip checkpoint ids. An upgraded
store that kept its old tips with an empty l2_tip_checkpoints map would make
getL2Tips throw on every read with no way to self-heal, failing P2PClient
startup. Bumping the schema version resets the store on upgrade.
The chain-pruned event carried only the source checkpointed tip, so a prune
that rolled back the proven chain clamped the local proven cursor onto the
checkpointed tip, transiently reporting unproven blocks as proven until the
corrective chain-proven event landed at the end of the same sync iteration.
Carry the source proven tip in the event and clamp each cursor to its own tip.
…tate tips provider

The block stream's local data provider demanded full checkpoint-bearing tips,
forcing world-state to fabricate genesis checkpoint ids and a checkpointed tip
at block zero that violated the finalized <= checkpointed invariant. Narrow the
provider contract to the tips the stream actually reads (checkpointed is only
required when emitting checkpoint events) and have the stream fail loudly if
checkpoint emission is enabled without one. World-state now reports only the
proposed, proven, and finalized blocks it genuinely tracks; the full LocalL2Tips
shape remains on L2TipsProvider for the p2p and pxe tips stores.
…aring events

The CheckpointStore redesign landed on the merge train a prover-node consumer
of chain-pruned written against the old event shape while this branch reshaped
it: chain-pruned now carries checkpointed/proven tips instead of a bare
checkpoint id, and chain-proven/chain-finalized carry checkpoint ids. Read the
pruned checkpoint from event.checkpointed.checkpoint (same semantics as the old
field) and update the test event constructors, including a hash() on the fake
Checkpoint now that the tips store records checkpoint ids on checkpoint events.
@spalladino spalladino force-pushed the spl/fix-tips-checkpoint-ids-fail-loud branch from 32f2b37 to 15cc304 Compare June 10, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants