Skip to content

feat(check): add radius-decrease liveness check for puller manage() recovery#581

Open
misaakidis wants to merge 2 commits intomasterfrom
ci/radius-decrease-check
Open

feat(check): add radius-decrease liveness check for puller manage() recovery#581
misaakidis wants to merge 2 commits intomasterfrom
ci/radius-decrease-check

Conversation

@misaakidis
Copy link
Copy Markdown
Member

Summary

  • Adds pkg/check/radiusdecrease — a three-phase beekeeper check that validates puller worker liveness after a storage-radius decrease
  • Registers the radius-decrease check type in pkg/config/check.go
  • Adds ci-radius-decrease entry to config/local.yaml (timeout: 28 m, recovery-timeout: 20 m)

Background

When the reserve worker decreases the storage radius it calls manage(), which calls disconnectPeer() for every current peer. If disconnectPeer() blocks while holding syncPeersMtx, manage() freezes indefinitely. This was the root cause of the 14-hour liveness degradation on Gnosis Chain at block 43086913 (depth 10→9 transition).

The check triggers the cascade in CI using a patched bee binary (see bee .github/patches/radius_decrease_*.patch) and verifies that PullsyncRate > 0 returns within 20 minutes of the radius decrease.

Test plan

  • Beekeeper check compiles (go build ./...)
  • CI runs on the bee PR fix/puller-disconnect-backoffradius-decrease job passes on the fix branch
  • Once bee PR merges, revert BEEKEEPER_BRANCH to "master" in the bee workflow

Always passing a non-nil ProcMount pointer (even when the value is "")
causes k3s to reject the security context, resulting in container
termination shortly after start. Guard the field so it is only set
when a non-empty ProcMount type is configured.

Also bypass the traefik.containo.us IngressRoute code path; clusters
that ship Traefik without the legacy CRDs would fail at object creation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant