fix(db): correct stale catalog planner stats + debounce health-check alerts#5317
Merged
Conversation
…alerts catalog_properties autoanalyze had never run, leaving planner statistics frozen near zero (~185 rows) while the table actually holds 2.27M. The planner chose nested-loop/seq-scan plans sized for a tiny table, so queries that scan the catalog/registry tables (brand enrichment, audit, registry reads) ran for tens of seconds. - Add migration 505: aggressive per-table autovacuum/analyze tuning for catalog_properties, catalog_identifiers, registry_requests so stats can never drift that far again, plus a one-time ANALYZE. - Debounce the /health DB probe: a single transient connect timeout during a rolling deploy or Postgres failover no longer pages the error channel; escalate only after consecutive failures. The 503 response is unchanged.
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
Contributor
There was a problem hiding this comment.
Approving. Right diagnosis (stale planner stats, not capacity) and the fix matches the failure: pin a tight reloptions cadence so pg_class.reltuples can never drift four orders of magnitude again, and seed the corrected stats with a one-time ANALYZE.
Things I checked
- Migration runs inside
BEGIN/COMMITperserver/src/db/migrate.ts:130-142.ALTER TABLE ... SET (reloptions)andANALYZEare both transaction-safe;VACUUMwould not be. The PR description gets this right. migrate.ts:131setsSET LOCAL statement_timeout = 0, so the three serialANALYZEs on the 2.27M-rowcatalog_propertieswon't trippg code 57014mid-deploy.- Migration 505 is the next sequential number —
504_admin_stripe_customer_update_previews.sqlis the current head. - Re-application against the live state is idempotent — both
ALTER TABLE ... SETfor reloptions andANALYZEare safe to re-run, so the migration converges with the values already applied to prod. server/src/http.ts:2249is correctly demoted fromlogger.errortologger.warnon the below-threshold branch. That matters:logger.erroris what PostHog forwards to the Slack error channel, so the debounce is doing real work, not cosmetic gating aroundnotifySystemError.- 503 response is unchanged — every failed probe still pulls the machine out of the LB on probe 1. The debounce only affects when humans get paged, not when traffic gets drained.
- Changeset present,
patchbump, accurate prose. No wire change.
Follow-ups (non-blocking — file as issues)
- Flap suppression. Hard-reset-on-success means a flapping DB (fail/ok/fail/ok) never reaches the threshold and never pages, even while users see intermittent 503s. Defensible given that the 503 already drains the machine and a real cluster-level failover converges to sustained failure quickly, but worth a sliding-window or decaying counter if
#admin-errorslater goes quiet during a real flap. - Counter races across concurrent probes. Module-level mutable state with two probes mid-flight can over/undercount by ±1. Worst case is paging one probe-interval early or late. Not worth a mutex; flagging only because the math isn't obvious from the diff.
Minor nits (non-blocking)
- Split
ALTERs fromANALYZEs. The reloptions changes land instantly; theANALYZEs on three large tables each holdShareUpdateExclusiveLockfor the duration of the sample. Non-blocking for reads/writes, but if autovacuum is mid-cycle oncatalog_propertiesthe migration waits behind it. A two-migration split (505 =ALTER, 506 =ANALYZE) would let the durable settings land regardless of what autovacuum is doing. Not worth churning this PR; note for next time.
Notable choice using the migration as code-of-record for an ops change already applied live — that's the right shape for this, since the next rebuilt database otherwise inherits the cluster defaults that caused the incident.
Safe to merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production database queries on the catalog/registry tables were running for tens of seconds. Root cause: stale query-planner statistics, not capacity.
catalog_propertiesautoanalyze had never run, sopg_class.reltuplesstayed frozen near zero (~185 rows) while the table actually holds 2.27M rows (registry_requests: planner thought 7.4k, actually 362k). With estimates that wrong, Postgres chose nested-loop / sequential-scan plans sized for a tiny table — fine on 185 rows, catastrophic on 2.3M. This made the heavy endpoints (brand enrichment, admin audit, registry reads) slow.Changes
505_catalog_autovacuum_tuning.sql— per-table autovacuum/analyze tuning (autovacuum_analyze_scale_factor=0.02,threshold=5000, etc.) oncatalog_properties,catalog_identifiers, andregistry_requestsso planner stats can never drift that far again, plus a one-timeANALYZE. Storage-param changes are online/non-blocking;ANALYZE(notVACUUM) keeps the migration transaction-safe.http.ts— debounce the/healthDB probe. A single transient connect timeout during a rolling deploy or Postgres failover no longer pages#admin-errors; alerting escalates only after consecutive failures. The 503 load-balancer response is unchanged.Operational note
The corrective
VACUUM (ANALYZE)and the per-table autovacuum settings were already applied live to the production Managed Postgres cluster while diagnosing (verified: planner now estimates 2,274,834 rows forcatalog_properties). This migration persists those settings in code so they survive and apply to any rebuilt database — it is idempotent against the current live state.Testing
npm run typecheck— passestsc --project server/tsconfig.jsonbuild — passesEXPLAINrow estimate corrected from 185 → 2,274,834.