fix(db): correct stale catalog planner stats + debounce health-check alerts by EmmaLouise2018 · Pull Request #5317 · adcontextprotocol/adcp

EmmaLouise2018 · 2026-06-03T23:41:54Z

Summary

Production database queries on the catalog/registry tables were running for tens of seconds. Root cause: stale query-planner statistics, not capacity.

catalog_properties autoanalyze had never run, so pg_class.reltuples stayed frozen near zero (~185 rows) while the table actually holds 2.27M rows (registry_requests: planner thought 7.4k, actually 362k). With estimates that wrong, Postgres chose nested-loop / sequential-scan plans sized for a tiny table — fine on 185 rows, catastrophic on 2.3M. This made the heavy endpoints (brand enrichment, admin audit, registry reads) slow.

Changes

505_catalog_autovacuum_tuning.sql — per-table autovacuum/analyze tuning (autovacuum_analyze_scale_factor=0.02, threshold=5000, etc.) on catalog_properties, catalog_identifiers, and registry_requests so planner stats can never drift that far again, plus a one-time ANALYZE. Storage-param changes are online/non-blocking; ANALYZE (not VACUUM) keeps the migration transaction-safe.
http.ts — debounce the /health DB probe. A single transient connect timeout during a rolling deploy or Postgres failover no longer pages #admin-errors; alerting escalates only after consecutive failures. The 503 load-balancer response is unchanged.

Operational note

The corrective VACUUM (ANALYZE) and the per-table autovacuum settings were already applied live to the production Managed Postgres cluster while diagnosing (verified: planner now estimates 2,274,834 rows for catalog_properties). This migration persists those settings in code so they survive and apply to any rebuilt database — it is idempotent against the current live state.

Testing

npm run typecheck — passes
tsc --project server/tsconfig.json build — passes
Planner fix verified directly: EXPLAIN row estimate corrected from 185 → 2,274,834.
Public page latency measured healthy server-side (homepage p50 71ms / p95 108ms over 30 reqs); the slowness was isolated to the query-heavy paths the stale stats mis-planned.

…alerts catalog_properties autoanalyze had never run, leaving planner statistics frozen near zero (~185 rows) while the table actually holds 2.27M. The planner chose nested-loop/seq-scan plans sized for a tiny table, so queries that scan the catalog/registry tables (brand enrichment, audit, registry reads) ran for tens of seconds. - Add migration 505: aggressive per-table autovacuum/analyze tuning for catalog_properties, catalog_identifiers, registry_requests so stats can never drift that far again, plus a one-time ANALYZE. - Debounce the /health DB probe: a single transient connect timeout during a rolling deploy or Postgres failover no longer pages the error channel; escalate only after consecutive failures. The 503 response is unchanged.

mintlify · 2026-06-03T23:41:57Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
adcp	🟢 Ready	View Preview	Jun 4, 2026, 12:01 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

aao-release-bot

Approving. Right diagnosis (stale planner stats, not capacity) and the fix matches the failure: pin a tight reloptions cadence so pg_class.reltuples can never drift four orders of magnitude again, and seed the corrected stats with a one-time ANALYZE.

Things I checked

Migration runs inside BEGIN/COMMIT per server/src/db/migrate.ts:130-142. ALTER TABLE ... SET (reloptions) and ANALYZE are both transaction-safe; VACUUM would not be. The PR description gets this right.
migrate.ts:131 sets SET LOCAL statement_timeout = 0, so the three serial ANALYZEs on the 2.27M-row catalog_properties won't trip pg code 57014 mid-deploy.
Migration 505 is the next sequential number — 504_admin_stripe_customer_update_previews.sql is the current head.
Re-application against the live state is idempotent — both ALTER TABLE ... SET for reloptions and ANALYZE are safe to re-run, so the migration converges with the values already applied to prod.
server/src/http.ts:2249 is correctly demoted from logger.error to logger.warn on the below-threshold branch. That matters: logger.error is what PostHog forwards to the Slack error channel, so the debounce is doing real work, not cosmetic gating around notifySystemError.
503 response is unchanged — every failed probe still pulls the machine out of the LB on probe 1. The debounce only affects when humans get paged, not when traffic gets drained.
Changeset present, patch bump, accurate prose. No wire change.

Follow-ups (non-blocking — file as issues)

Flap suppression. Hard-reset-on-success means a flapping DB (fail/ok/fail/ok) never reaches the threshold and never pages, even while users see intermittent 503s. Defensible given that the 503 already drains the machine and a real cluster-level failover converges to sustained failure quickly, but worth a sliding-window or decaying counter if #admin-errors later goes quiet during a real flap.
Counter races across concurrent probes. Module-level mutable state with two probes mid-flight can over/undercount by ±1. Worst case is paging one probe-interval early or late. Not worth a mutex; flagging only because the math isn't obvious from the diff.

Minor nits (non-blocking)

Split ALTERs from ANALYZEs. The reloptions changes land instantly; the ANALYZEs on three large tables each hold ShareUpdateExclusiveLock for the duration of the sample. Non-blocking for reads/writes, but if autovacuum is mid-cycle on catalog_properties the migration waits behind it. A two-migration split (505 = ALTER, 506 = ANALYZE) would let the durable settings land regardless of what autovacuum is doing. Not worth churning this PR; note for next time.

Notable choice using the migration as code-of-record for an ops change already applied live — that's the right shape for this, since the next rebuilt database otherwise inherits the cluster defaults that caused the incident.

Safe to merge.

EmmaLouise2018 added 2 commits June 3, 2026 19:38

chore: add changeset for catalog planner stats fix

5a90280

aao-release-bot Bot approved these changes Jun 3, 2026

View reviewed changes

mintlify Bot deployed to staging June 4, 2026 00:00 View deployment

EmmaLouise2018 merged commit a746adf into main Jun 4, 2026
16 checks passed

EmmaLouise2018 deleted the EmmaLouise2018/fix-fly-adcp-db branch June 4, 2026 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(db): correct stale catalog planner stats + debounce health-check alerts#5317

fix(db): correct stale catalog planner stats + debounce health-check alerts#5317
EmmaLouise2018 merged 2 commits into
mainfrom
EmmaLouise2018/fix-fly-adcp-db

EmmaLouise2018 commented Jun 3, 2026

Uh oh!

mintlify Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

aao-release-bot Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EmmaLouise2018 commented Jun 3, 2026

Summary

Changes

Operational note

Testing

Uh oh!

mintlify Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aao-release-bot Bot left a comment

Choose a reason for hiding this comment

Things I checked

Follow-ups (non-blocking — file as issues)

Minor nits (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mintlify Bot commented Jun 3, 2026 •

edited

Loading