Skip to content

fix(db): correct stale catalog planner stats + debounce health-check alerts#5317

Merged
EmmaLouise2018 merged 2 commits into
mainfrom
EmmaLouise2018/fix-fly-adcp-db
Jun 4, 2026
Merged

fix(db): correct stale catalog planner stats + debounce health-check alerts#5317
EmmaLouise2018 merged 2 commits into
mainfrom
EmmaLouise2018/fix-fly-adcp-db

Conversation

@EmmaLouise2018
Copy link
Copy Markdown
Contributor

Summary

Production database queries on the catalog/registry tables were running for tens of seconds. Root cause: stale query-planner statistics, not capacity.

catalog_properties autoanalyze had never run, so pg_class.reltuples stayed frozen near zero (~185 rows) while the table actually holds 2.27M rows (registry_requests: planner thought 7.4k, actually 362k). With estimates that wrong, Postgres chose nested-loop / sequential-scan plans sized for a tiny table — fine on 185 rows, catastrophic on 2.3M. This made the heavy endpoints (brand enrichment, admin audit, registry reads) slow.

Changes

  • 505_catalog_autovacuum_tuning.sql — per-table autovacuum/analyze tuning (autovacuum_analyze_scale_factor=0.02, threshold=5000, etc.) on catalog_properties, catalog_identifiers, and registry_requests so planner stats can never drift that far again, plus a one-time ANALYZE. Storage-param changes are online/non-blocking; ANALYZE (not VACUUM) keeps the migration transaction-safe.
  • http.ts — debounce the /health DB probe. A single transient connect timeout during a rolling deploy or Postgres failover no longer pages #admin-errors; alerting escalates only after consecutive failures. The 503 load-balancer response is unchanged.

Operational note

The corrective VACUUM (ANALYZE) and the per-table autovacuum settings were already applied live to the production Managed Postgres cluster while diagnosing (verified: planner now estimates 2,274,834 rows for catalog_properties). This migration persists those settings in code so they survive and apply to any rebuilt database — it is idempotent against the current live state.

Testing

  • npm run typecheck — passes
  • tsc --project server/tsconfig.json build — passes
  • Planner fix verified directly: EXPLAIN row estimate corrected from 185 → 2,274,834.
  • Public page latency measured healthy server-side (homepage p50 71ms / p95 108ms over 30 reqs); the slowness was isolated to the query-heavy paths the stale stats mis-planned.

…alerts

catalog_properties autoanalyze had never run, leaving planner statistics
frozen near zero (~185 rows) while the table actually holds 2.27M. The
planner chose nested-loop/seq-scan plans sized for a tiny table, so queries
that scan the catalog/registry tables (brand enrichment, audit, registry
reads) ran for tens of seconds.

- Add migration 505: aggressive per-table autovacuum/analyze tuning for
  catalog_properties, catalog_identifiers, registry_requests so stats can
  never drift that far again, plus a one-time ANALYZE.
- Debounce the /health DB probe: a single transient connect timeout during a
  rolling deploy or Postgres failover no longer pages the error channel;
  escalate only after consecutive failures. The 503 response is unchanged.
@mintlify
Copy link
Copy Markdown

mintlify Bot commented Jun 3, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
adcp 🟢 Ready View Preview Jun 4, 2026, 12:01 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Copy link
Copy Markdown
Contributor

@aao-release-bot aao-release-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. Right diagnosis (stale planner stats, not capacity) and the fix matches the failure: pin a tight reloptions cadence so pg_class.reltuples can never drift four orders of magnitude again, and seed the corrected stats with a one-time ANALYZE.

Things I checked

  • Migration runs inside BEGIN/COMMIT per server/src/db/migrate.ts:130-142. ALTER TABLE ... SET (reloptions) and ANALYZE are both transaction-safe; VACUUM would not be. The PR description gets this right.
  • migrate.ts:131 sets SET LOCAL statement_timeout = 0, so the three serial ANALYZEs on the 2.27M-row catalog_properties won't trip pg code 57014 mid-deploy.
  • Migration 505 is the next sequential number — 504_admin_stripe_customer_update_previews.sql is the current head.
  • Re-application against the live state is idempotent — both ALTER TABLE ... SET for reloptions and ANALYZE are safe to re-run, so the migration converges with the values already applied to prod.
  • server/src/http.ts:2249 is correctly demoted from logger.error to logger.warn on the below-threshold branch. That matters: logger.error is what PostHog forwards to the Slack error channel, so the debounce is doing real work, not cosmetic gating around notifySystemError.
  • 503 response is unchanged — every failed probe still pulls the machine out of the LB on probe 1. The debounce only affects when humans get paged, not when traffic gets drained.
  • Changeset present, patch bump, accurate prose. No wire change.

Follow-ups (non-blocking — file as issues)

  • Flap suppression. Hard-reset-on-success means a flapping DB (fail/ok/fail/ok) never reaches the threshold and never pages, even while users see intermittent 503s. Defensible given that the 503 already drains the machine and a real cluster-level failover converges to sustained failure quickly, but worth a sliding-window or decaying counter if #admin-errors later goes quiet during a real flap.
  • Counter races across concurrent probes. Module-level mutable state with two probes mid-flight can over/undercount by ±1. Worst case is paging one probe-interval early or late. Not worth a mutex; flagging only because the math isn't obvious from the diff.

Minor nits (non-blocking)

  1. Split ALTERs from ANALYZEs. The reloptions changes land instantly; the ANALYZEs on three large tables each hold ShareUpdateExclusiveLock for the duration of the sample. Non-blocking for reads/writes, but if autovacuum is mid-cycle on catalog_properties the migration waits behind it. A two-migration split (505 = ALTER, 506 = ANALYZE) would let the durable settings land regardless of what autovacuum is doing. Not worth churning this PR; note for next time.

Notable choice using the migration as code-of-record for an ops change already applied live — that's the right shape for this, since the next rebuilt database otherwise inherits the cluster defaults that caused the incident.

Safe to merge.

@EmmaLouise2018 EmmaLouise2018 merged commit a746adf into main Jun 4, 2026
16 checks passed
@EmmaLouise2018 EmmaLouise2018 deleted the EmmaLouise2018/fix-fly-adcp-db branch June 4, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant