Skip to content

feat: add retry backoff for connection errors in retryOperation#775

Draft
leshniak wants to merge 4 commits intoExpensify:mainfrom
callstack-internal:feat/retry-backoff-connection-errors
Draft

feat: add retry backoff for connection errors in retryOperation#775
leshniak wants to merge 4 commits intoExpensify:mainfrom
callstack-internal:feat/retry-backoff-connection-errors

Conversation

@leshniak
Copy link
Copy Markdown
Contributor

@leshniak leshniak commented Apr 17, 2026

Details

Add exponential backoff with jitter to OnyxUtils.retryOperation for non-capacity storage errors, framed as an instrumented experiment to determine the right mitigation strategy.

Context: Connection-class errors — Chromium backing store failures (26.3%), WebKit connection drops (19.0%), and closing-database errors (6.4%) — account for 51.7% of all storage failures (investigation). Analysis of 7-day production logs via VictoriaLogs shows:

  • 0% recovery rate across 5 immediate retries — all attempts complete within the same millisecond
  • 100% retry exhaustion — 67,404 out of ~67,400 initial failures exhaust all retries
  • Users continue writing successfully to other keys in the same session (e.g. one user: 8,920 exhaustions alongside 28,206 successful Onyx ops)
  • IDB enters a degraded state rather than failing completely — successful writes interleaved with failures in the same request

We have no data on whether introducing a delay (100ms-1600ms) would allow recovery. This PR adds backoff as a low-risk experiment to collect that data.

Changes:

  • lib/OnyxUtils.ts: Added CONNECTION_ERRORS constants (IDB + SQLite), backoff config (RETRY_BASE_DELAY_MS=100, RETRY_JITTER_FACTOR=0.25), wait()/getRetryDelay() helpers, wired backoff into non-capacity error branch of retryOperation
  • Backoff schedule: 100ms → 200ms → 400ms → 800ms → 1600ms (±25% jitter, ~3.1s total max)
  • Connection errors get a specific log message with delay duration for production observability
  • Capacity errors (QuotaExceeded, disk full) keep immediate retry with eviction — unchanged

Next steps based on production data:

  • If recovery rate at 100-1600ms delays improves → tune constants
  • If recovery rate remains ~0% → pivot to fail-fast (fewer retries + error propagation) or reconnection strategy (close/reopen IDB before retry)

Related Issues

Expensify/App#87782

Automated Tests

Updated 2 existing retry tests to use fake timers (backoff delays require timer advancement). Added 3 new tests:

  • should apply exponential backoff delay for non-capacity errors — verifies delay count and exponential growth pattern
  • should log connection error with backoff delay info — verifies connection-specific log message
  • should NOT apply backoff delay for capacity errors (immediate retry with eviction) — verifies capacity errors remain immediate

All 437 tests pass.

Manual Tests

  1. Verify npm run typecheck passes
  2. Verify npm run lint passes
  3. Verify npm test passes (437/437)
  4. Integrate with Expensify/App and verify storage operations still work correctly on all platforms

Author Checklist

  • I linked the correct issue in the ### Related Issues section above
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android / native
    • Android / Chrome
    • iOS / native
    • iOS / Safari
    • MacOS / Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that the left part of a conditional rendering a React component is a boolean and NOT a string, e.g. myBool && <MyComponent />.
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained why the code was doing something instead of only explaining what the code was doing.
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named index.js. All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.js or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.
  • I have checked off every checkbox in the PR author checklist, including those that don't apply to this PR.

Screenshots/Videos

Android: Native

N/A — library-level change, no UI

Android: mWeb Chrome

N/A — library-level change, no UI

iOS: Native

N/A — library-level change, no UI

iOS: mWeb Safari

N/A — library-level change, no UI

MacOS: Chrome / Safari

N/A — library-level change, no UI

leshniak and others added 4 commits April 16, 2026 22:34
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Errors like lost IDB connections, closing databases, and backing store
failures now wait with exponential backoff (100ms * 2^attempt +/- 25%
jitter) before retrying, giving the DB connection time to recover.

Capacity errors (QuotaExceeded, disk full) keep immediate retry with
eviction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Existing retry tests now use fake timers to handle backoff delays.
New tests verify: exponential delay progression, connection error
logging, and capacity errors remaining immediate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant