Skip to content

WIP:chore: bump morph + geth to derivation-batch-verify; align hoodi snap…#92

Open
curryxbo wants to merge 2 commits into
mainfrom
feat/hoodi-binary-batch-verify
Open

WIP:chore: bump morph + geth to derivation-batch-verify; align hoodi snap…#92
curryxbo wants to merge 2 commits into
mainfrom
feat/hoodi-binary-batch-verify

Conversation

@curryxbo
Copy link
Copy Markdown
Contributor

…shot

Bump morph submodule to feat/derivation-batch-verify tip (d27d088c) and go-ethereum submodule to morph-v2.2.2 (045be0fd) so the new self-verifying node implementation (SPEC-005) builds. The new morphnode no longer accepts the --validator flag; every node starts the derivation pipeline by default.

Also align morph-node/.env_hoodi to snapshot-20260509-1 (the freshest hoodi MPT snapshot listed in README.md) so 'make run-hoodi-node-binary' boots against fresh state.

This is a hand-off branch for QA testing of the hoodi binary path. Docker, mainnet env, README, and the legacy validator/zk Make targets are intentionally untouched in this round.

…shot

Bump morph submodule to feat/derivation-batch-verify tip (d27d088c) and
go-ethereum submodule to morph-v2.2.2 (045be0fd) so the new self-verifying
node implementation (SPEC-005) builds. The new morphnode no longer accepts
the --validator flag; every node starts the derivation pipeline by default.

Also align morph-node/.env_hoodi to snapshot-20260509-1 (the freshest hoodi
MPT snapshot listed in README.md) so 'make run-hoodi-node-binary' boots
against fresh state.

This is a hand-off branch for QA testing of the hoodi binary path. Docker,
mainnet env, README, and the legacy validator/zk Make targets are
intentionally untouched in this round.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@curryxbo curryxbo force-pushed the feat/hoodi-binary-batch-verify branch from d97ec2e to 23cbed0 Compare May 14, 2026 09:23
测试 morph feat/derivation-batch-verify 分支的 derivation 和 tag 管理功能

主要发现:
1. Path A 需要 beacon chain API,当前 L1 RPC 不支持
2. Path B 配置成功,但遇到 snapshot 数据与配置不匹配问题
3. 发现 Path B 缺少等待机制:当区块缺失时静默停止,而不是等待同步
4. Tag 管理实现正确,但依赖 batch 验证完成才能推进

配置变更:
- 添加 NODE_EXTRA_FLAGS="--derivation.verify-mode=pathB" 启用 Path B 模式

详细测试报告见:docs/test-derivation-batch-verify-20260514.md
curryxbo pushed a commit to morph-l2/morph that referenced this pull request May 14, 2026
QA hit a "Path B silently stops" symptom on hoodi against an old snapshot:
local L2 latest 4,470,254 lagged the L1-committed batch's required
range 5,279,569-5,279,890 by ~800k blocks, but the only visible log
was `path B fetched batch metadata`; no error surfaced for tens of
minutes.

Root cause: retryableError() classified every error except a literal
"discontinuous block number" string as retryable, so RetryableClient
kept exponentially backing off ethereum.NotFound for the full 30-minute
MaxElapsedTime budget. SPEC-005 Path B is the first caller to read L2
blocks the local node may not yet have sealed; older callers
(AssembleL2Block / NewSafeL2Block / sequencer paths) only ever read
known-existing blocks, masking the issue.

Treat ethereum.NotFound as permanent so it escapes the backoff loop on
the first attempt:

- retryableError() short-circuits on errors.Is(err, ethereum.NotFound)
  (handles fmt.Errorf wrapping too).
- HeaderByNumber / BlockByNumber log Info on retryable failures (still
  transient chatter) and Error on the non-retryable escape path so the
  signal is visible even when the caller layer's logging is filtered.

Net effect for the QA scenario: BlockByNumber returns NotFound to
verify_path_b on the first attempt; verify_path_b returns
"path B: read local block N failed: not found"; derivation.go logs
"path B content verification failed" Error and the next pollInterval
re-evaluates. The operator immediately sees the local-height gap
instead of staring at a silent log.

Adds node/types/retryable_client_test.go covering NotFound (direct +
wrapped), DiscontinuousBlockError, and generic transient errors.

go build ./node/types/ ./node/derivation/ -- clean.
go test ./node/types/... ./node/derivation/... -count=1 -- PASS
(3 new in types, 22 in derivation).

Refs: morph-l2/run-morph-node#92 testing report (2026-05-14).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
curryxbo pushed a commit to morph-l2/morph that referenced this pull request May 14, 2026
When SPEC-005 Path B verification fails — particularly with a versioned
hash mismatch — operators previously had to grep across logs and hand-
fetch L1 calldata to reconstruct the batch shape that produced the
mismatch. Add a single structured Error log at every Path B fail point
so the relevant fields are present without re-running anything:

  kind: invalid_block_range / empty_blob_hashes / local_block_missing
        / local_block_read_error / parsing_txs_error / compress_error
        / sidecar_build_error / blob_count_mismatch / versioned_hash_mismatch
  always: batchIndex, version, firstBlock, lastBlock,
          parentTotalL1Popped, expectedBlobs
  per-site: blockNumber (block-level errors); encoding, payloadLen,
            compressedLen, rebuiltBlobs, rebuiltHashes,
            expectedHashes, mismatchIndex (encoding / hash errors)

The pathBFail helper centralises the log + metric increment + error
wrap so call sites stay one-liners, and the existing
"path B fetched batch metadata" entry log is enriched with batchIndex,
version, parentTotalL1Popped, expectedBlobs so an operator can spot
abnormal entry conditions without waiting for a failure.

New metric:
  derivation_path_b_failed_by_kind_total{kind="..."}
incremented alongside the unlabelled path_b_failed_total via
IncPathBFailedKind so dashboards can split failures by category.

Cost: zero on the success path; the diagnostic computation (slice
lengths, hex CSV of <= 6 hashes) only runs at fail points.

go test ./node/derivation/... -- 22 cases PASS.

Refs: morph-l2/run-morph-node#92 (operator request: don't make us
hand-roll a one-shot script every time Path B fails).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
curryxbo pushed a commit to morph-l2/morph that referenced this pull request May 15, 2026
…ant, not Version byte

QA hit a versioned hash mismatch on hoodi (commit txHash 0x763f5f76...,
batchIndex 17367) with rebuilt=0x015a6d... vs expected=0x018577... (both
valid EIP-4844 hashes, just different bytes). Tracing the sequencer
side end to end shows Path B's V1/V2 dispatch is keyed on the wrong
field:

Sequencer's actual encoding decision (chain of code):

  1. tx-submitter passes isBatchUpgraded=nil to NewBatchCache
     (tx-submitter/services/rollup.go:127-138).
  2. NewBatchCache defaults nil to `func(uint64) bool { return true }`
     (common/batch/batch_cache.go:102-104), so isBatchUpgraded is
     effectively always true for any live sequencer.
  3. handleBatchSealing always enters the V2 branch first and uses
     TxsPayloadV2 whenever the compressed result fits in sealBlobCap
     (batch_cache.go:787-829). The V1 fallback only triggers when V2
     overflows AND isBatchV2Upgraded(ts) is still false, which is rare
     for normal-sized batches.
  4. createBatchHeader stamps the version byte from isBatchV2Upgraded
     alone (batch_cache.go:918-934): before that governance flag flips,
     version=1 even when the payload is V2-encoded.
  5. The new commitBatch ABI (rollup.go:1128-1136) does not carry
     BlockContexts in calldata, so the blob payload MUST be V2-encoded
     for the chain history to be reconstructable.

Path A already keys off `batch.BlockContexts != nil` from calldata
(batch_info.go::ParseBatch), which is the correct discriminator. Path B
keyed off `batchInfo.version >= 2`, treating every version=1 batch as
V1-encoded — exactly the failure surfaced by QA on hoodi during the
V1->V2 transition window.

Fix:

- BatchInfo gains hasCalldataBlockContexts, set in
  ParseBatchMetadataOnly to len(batch.BlockContexts) > 0. (Field doc
  spells out why version byte is wrong here.)
- verifyPathBContent dispatches on hasCalldataBlockContexts:
    true  -> TxsPayload   (legacy ABI: blob = txs only)
    false -> TxsPayloadV2 (new ABI:    blob = blockContexts || txs)
  The previous `version >= 2` branch is gone.
- pathBFail structured log adds hasCalldataBlockContexts so future
  diagnoses see the dispatch input directly.

Tests:

- Renamed RoundTripOK_V1/V2 to RoundTripOK_LegacyABI/NewABI and
  switched the oracle's parameter from `version` to `useV2Encoding`.
- Added TestPathB_VersionByte1_NewABI_UsesV2Encoding as a direct
  regression for the QA case (version=1 + new ABI -> blob=V2). This
  test fails on the prior dispatch and passes on the fix.

go build ./node/derivation/ -- clean.
go test ./node/derivation/... -count=1 -- 23 cases PASS (was 22; +1 regression).

Refs: morph-l2/run-morph-node#92 (hoodi hash mismatch report,
2026-05-15 09:38).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
curryxbo pushed a commit to morph-l2/morph that referenced this pull request May 15, 2026
…ap, not Rollup.BatchDataStore

QA observed finalized stuck on hoodi while safe kept advancing. Cast
queries against the rollup contract showed:

  lastCommittedBatchIndex@L1Finalized = 17797
  batchDataStore(17797).blockNumber  = 5418036
  batchDataStore(17796) = 0,0,0,0
  batchDataStore(17389) = 0,0,0,0

The contract intentionally clears BatchDataStore storage for older
batches as part of its on-chain GC; only the latest committed batch's
record stays populated. The finalizer's
`Rollup.BatchDataStore(candidate)` lookup therefore returns zero for
any candidate that isn't the very newest, the existing zero-guard
skips advancement, and finalized never moves while node logs
`finalizer: batch has zero lastL2Block; skipping`.

The discriminator-source fix: tagAdvancer is the right place to hold
the (batchIndex -> header) mapping because advanceSafe is already
called once per verified batch with the header in hand. Move the
lookup off-chain entirely:

- tagAdvancer gains verifiedBatches map[uint64]*eth.Header, populated
  inside advanceSafe alongside safeMaxBatchIndex.
- New LookupVerifiedBatchHeader(batchIndex) replaces
  finalizer.lookupBatchLastL2Block + the contract call.
- advanceFinalized evicts map entries <= the new finalized index,
  keeping the map bounded by the steady-state safe-vs-finalized lag.
- reset (L1 reorg) clears the map: pre-reorg entries aren't
  authoritative against the new L1 view; derivation refills naturally
  as it walks the rewound cursor.
- finalizer.tick: 4 RPC calls -> 2 (drop BatchDataStore +
  HeaderByNumber); the L2 client / lookupBatchLastL2Block helper /
  zero-BlockNumber defensive guard are gone since none are reachable
  anymore. newFinalizer no longer takes l2Client.

Restart behavior is unchanged: map starts empty; first finalizer ticks
log
  finalizer: verified batch header not found in local map; will retry
until derivation has re-verified up to a candidate that intersects the
new map. Same outcome, clearer signal -- and it doesn't depend on
contract state retention.

Tests:
- TestTagAdvance_VerifiedBatchLookup: roundtrip after advanceSafe.
- TestTagAdvance_VerifiedBatchEvictedOnFinalize: entries <= finalized
  are dropped, entries > finalized retained.
- TestTagAdvance_VerifiedBatchClearedOnReset: L1 reorg wipes the map.

Spec impact: tech-design.md §4.7.4's finalizer description still names
"Rollup.BatchDataStore" as the lookup source. That sentence needs an
update in morph-specs to "tagAdvancer's local verifiedBatches map";
not blocking the implementation PR.

go build ./node/derivation/ -- clean.
go test ./node/derivation/... -count=1 -- 26 cases PASS (was 23; +3 lookup/eviction/reset).

Refs: morph-l2/run-morph-node#92 (hoodi: finalized stuck while safe
advances; finalizer: batch has zero lastL2Block; skipping batchIndex=17394).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
curryxbo pushed a commit to morph-l2/morph that referenced this pull request May 15, 2026
…k@finalizedBlock), not BatchDataStore reverse-lookup

QA's finalized-stuck observation on hoodi traced to two issues:

  1. Rollup.batchDataStore is on-chain GC'd (Rollup.sol:665 deletes
     batchDataStore[_batchIndex - 1] inside finalizeBatch), so the
     previous design's `min(committedAtFin, safeMaxBatchIndex)` candidate
     would frequently land on an older batchIndex whose storage slot
     had been cleared, returning zero and triggering the defensive skip.
  2. The previous fix (verifiedBatches map in tagAdvancer) compensated
     by holding a local batchIndex -> header map. That worked but was
     more machinery than needed: the contract retains
     batchDataStore[committedAtFin] at any given L1 block, because
     committedAtFin >= lastFinalizedBatchIndex@thatBlock, which is
     above the GC threshold at that block's state.

The cleaner fix is to pin both rollup queries to the L1 finalized block
and operate on L2 block NUMBERS, not batchIndex round-trips:

  L1FinalizedLastBlock = batchDataStore[committedAtFin]@finalizedBlock.blockNumber
  finalized.blockNumber = min(localSafe.number, L1FinalizedLastBlock)

For the common case (L1FinalizedLastBlock >= localSafe.number, default
Confirmations=finalized derivation steady-state) we anchor finalized to
local safe directly: hash + number are already in tagAdvancer memory.
For the other case (operator set Confirmations < finalized so derivation
ran ahead of L1 finalized) we anchor to L1FinalizedLastBlock and pull
the L2 hash via l2Client.HeaderByNumber (the block is local because
L1FinalizedLastBlock < localSafe.number and we verified up to localSafe).

Plus a defensive canonicality check: before advancing finalized, re-read
HeaderByNumber(safeNumber) against the L2 client and require the hash
to still equal tagAdvancer.safeL2Hash. On mismatch (L2 client state
divergence; or, future, an L1 reorg whose detection hasn't yet
re-synced the tag advancer) we skip the advance and reset tagAdvancer
to force re-verification rather than finalizing a stale safe.

Reverts the verifiedBatches map / LookupVerifiedBatchHeader / per-finalize
eviction added in c20983d -- not needed once we stop reverse-looking
batchIndex against the contract.

Changes:

- tagAdvancer.advanceFinalized signature: (ctx, batchIndex, *eth.Header)
  -> (ctx, batchIndex, hash, number). The "anchor to local safe" branch
  has hash + number directly without fabricating a synthetic header.
- tagAdvancer.Safe() new getter returns (safeL2Hash, safeL2Number) under
  mutex for atomic read by finalizer.
- finalizer.tick rewritten: 1 L1 RPC + 2 L1 contract calls (both pinned
  to L1 finalized) + 1 L2 RPC for the canonicality check, plus a second
  L2 RPC only for the rare safeNum > L1FinalizedLastBlock branch.
- finalizer struct keeps l2Client (needed for canonicality check + the
  rare-branch header fetch); newFinalizer signature unchanged from
  pre-c20983d4 era (l2Client back in).
- BatchDataStore zero-blockNumber defensive guard remains as a sanity
  fallback even though it should never fire under the pinned-query
  design (committed at finalized always > GC threshold at that block).
- Drops the 3 verifiedBatches lookup/eviction/reset tests; replaces
  with a single TestTagAdvance_SafeGetter covering the new snapshot.

Spec impact: tech-design.md §4.7.4's lookup phrasing changes from "look
up batch lastL2Block via Rollup.BatchDataStore" to "compare local safe
number against L1FinalizedLastBlock derived from the latest committed
batch at L1 finalized; anchor finalized to whichever is smaller". I'll
update morph-specs in a follow-up doc PR (c20983d's commit message
already promised this update; the new phrasing replaces it).

go build ./node/derivation/ -- clean.
go test ./node/derivation/... -count=1 -- 23 cases PASS (was 26 with
verifiedBatches tests; -3 dropped, +0 net since we replaced all 3
with TestTagAdvance_SafeGetter and the old finalizer/lookup tests
covered the same code paths).

Refs: morph-l2/run-morph-node#92 (hoodi: finalized stuck while safe
advances; node.log "finalizer: batch has zero lastL2Block; skipping
batchIndex=17394"; cast batchDataStore(17389) = 0,0,0,0,
batchDataStore(17797).blockNumber = 5418036).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant