fix(snapshot): close TOCTOU race on save/restore/delete#1003
fix(snapshot): close TOCTOU race on save/restore/delete#1003carlos-alm wants to merge 5 commits intomainfrom
Conversation
Use temp-file + atomic rename for `snapshotSave --force` and `snapshotRestore` so two concurrent saves or a restore racing with a reader can't observe a missing or partially-written database file. `snapshotDelete` now unlinks directly and maps ENOENT to the existing "not found" error instead of a separate existsSync check. Adds two regression tests: temp-file cleanup on success, and two concurrent --force saves producing a valid SQLite destination. Closes #995 Impact: 3 functions changed, 1 affected
|
Claude finished @carlos-alm's task —— View job PR Review: fix(snapshot): close TOCTOU race on save/restore/deleteTodo List:
Root Cause AnalysisProblem Being Solved: TOCTOU (Time-Of-Check-Time-Of-Use) race conditions in snapshot operations where concurrent save operations could corrupt files, and restore operations could leave readers with partially-written databases. Why This Approach: The fix implements atomic operations using temp files + atomic Risk Assessment: LOW RISK - The changes preserve existing public APIs, use well-established atomic file patterns, and include comprehensive regression tests. Backlog Compliance
Technical ReviewImplementation Quality ⭐⭐⭐⭐☆snapshotSave changes (lines 46-74):
snapshotRestore changes (lines 95-123):
snapshotDelete changes (lines 163-170):
Test Coverage ⭐⭐⭐⭐⭐New regression tests:
Test quality observations:
Code Quality DetailsMinor observations:
Architecture alignment:
Critical ConcernsNone. This is a solid defensive fix that eliminates race conditions without breaking changes. Final Recommendation
Minor suggestions (non-blocking):
This PR should be merged without concerns. |
Greptile SummaryThis PR closes the TOCTOU races in Confidence Score: 5/5Safe to merge — all three TOCTOU surfaces are correctly closed and prior review concerns have been fully addressed. No P0 or P1 bugs found. The temp-file lifecycle (pre-unlink before VACUUM, outer-catch cleanup on failure, best-effort unlink after linkSync) is correct in every code path. The non-force linkSync EEXIST check is POSIX-correct and NTFS-compatible. Both race tests use real worker threads with proper message-passing, and the assertions validate the contract rather than implementation details. Previous reviewer concerns (pseudo-concurrency, non-force TOCTOU window) are both resolved. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([snapshotSave called]) --> B{force?}
B -- "false" --> C{existsSync dest?}
C -- yes --> D([throw ConfigError fast-path])
C -- no --> E[mkdirSync snapshots dir]
B -- "true" --> E
E --> F["generate unique tmp path\n(pid + Date.now() + randomBytes)"]
F --> G[VACUUM INTO tmp]
G --> H{force?}
H -- "true" --> I["renameSync(tmp → dest)\natomic overwrite"]
H -- "false" --> J["linkSync(tmp, dest)\natomic guard"]
J -- "EEXIST" --> K[outer catch: unlink tmp] --> L([throw ConfigError])
J -- "other error" --> K
J -- "success" --> M["unlinkSync(tmp)\nbest-effort cleanup"]
I -- "success" --> N[statSync dest → return]
M --> N
I -- "error" --> O[outer catch: unlink tmp] --> P([rethrow])
N --> Q([return name/path/size])
Reviews (5): Last reviewed commit: "Merge branch 'main' into fix/snapshot-to..." | Re-trigger Greptile |
| it('does not corrupt output when two --force saves race on the same name', async () => { | ||
| // Prime the target so both calls take the --force overwrite path. | ||
| snapshotSave('race', { dbPath }); | ||
|
|
||
| // Issue both concurrently. Pre-fix, the existsSync/unlinkSync/VACUUM INTO | ||
| // sequence could interleave and produce a corrupt DB; post-fix, rename | ||
| // is atomic so the winner's file is intact and the loser either succeeds | ||
| // (overwriting) or fails cleanly. | ||
| const results = await Promise.allSettled([ | ||
| Promise.resolve().then(() => snapshotSave('race', { dbPath, force: true })), | ||
| Promise.resolve().then(() => snapshotSave('race', { dbPath, force: true })), | ||
| ]); | ||
|
|
||
| // At least one save must have succeeded. | ||
| const succeeded = results.filter((r) => r.status === 'fulfilled'); | ||
| expect(succeeded.length).toBeGreaterThanOrEqual(1); | ||
|
|
||
| // The final file must be a valid SQLite DB with the expected contents. | ||
| const finalPath = path.join(snapshotsDir(dbPath), 'race.db'); | ||
| const db = new Database(finalPath, { readonly: true }); | ||
| const rows = db.prepare('SELECT name FROM nodes').all(); | ||
| db.close(); | ||
| expect(rows).toEqual([{ name: 'hello' }]); | ||
| }); |
There was a problem hiding this comment.
Race test doesn't exercise actual concurrency
snapshotSave relies entirely on better-sqlite3, a synchronous native module. Wrapping two synchronous calls in Promise.resolve().then(...) produces microtasks that execute sequentially — the first .then() callback runs to completion before the second one is scheduled. Promise.allSettled here is equivalent to two back-to-back snapshotSave calls: the second never starts until the first's renameSync has already finished.
This test would pass identically against the pre-fix code, because there is no opportunity for interleaving on the Node.js microtask queue. The test validates the success path (which is already covered by the "overwrites with force" test) rather than the TOCTOU concern it describes.
Genuine concurrency in this module requires worker_threads. A worker-based harness that spawns two threads and synchronises them to fire their snapshotSave calls simultaneously would be the right approach, though admittedly heavyweight for a unit test. If that's out of scope, the test comment should be softened to avoid implying concurrent execution is happening.
There was a problem hiding this comment.
Addressed — replaced the Promise-based pseudo-concurrency with a worker_threads harness (tests/unit/snapshot-race-worker.mjs) that spawns two real threads racing on snapshotSave. The new test actually exercises the TOCTOU path and would fail against the pre-fix code. Also added a second test for the non-force concurrent-save case.
| if (!options.force && fs.existsSync(dest)) { | ||
| throw new ConfigError(`Snapshot "${name}" already exists. Use --force to overwrite.`); | ||
| } |
There was a problem hiding this comment.
Non-force path still has a TOCTOU window
With force: false, two concurrent callers can both pass existsSync(dest) → false, both proceed through VACUUM INTO tmp, and both successfully renameSync(tmp, dest) — silently overwriting each other with no error. The user expecting a "snapshot already exists" guard gets no protection in this race.
This is a narrower window than the --force case (the PR's stated target), so treating it as a known limitation is reasonable, but worth noting here since the fix otherwise closes most of the race surface.
There was a problem hiding this comment.
Addressed — closed the non-force TOCTOU by replacing renameSync with linkSync(tmp, dest) on the non-force path. linkSync fails atomically with EEXIST if dest already exists, making the existence check authoritative even under concurrent writers. The cheap existsSync above is kept as a fail-fast for the common case. Added a regression test (atomically rejects a concurrent non-force save when one already won) that spawns two worker threads and asserts exactly one wins with the other receiving the "already exists" error.
Codegraph Impact Analysis6 functions changed → 6 callers affected across 3 files
|
Address Greptile review feedback: 1. Non-force path still had a TOCTOU window: two concurrent callers could both pass existsSync(dest) and overwrite each other silently. Replace the final rename with linkSync(tmp, dest), which fails atomically with EEXIST if another writer won the race, then unlinkSync(tmp). The fail-fast existsSync above is retained as a cheap guard; linkSync is the authoritative check. 2. The original race regression test used Promise.resolve().then(...) on synchronous better-sqlite3 calls, so both microtasks executed sequentially and the test could not reproduce the TOCTOU it claimed to guard. Replace it with a worker_threads-based harness that spawns two workers racing on snapshotSave, plus a new non-force race test that verifies exactly one caller wins and the other sees a clear "already exists" error. 3. process.pid is shared across worker_threads in the same process, so add randomBytes(6) to the temp filename to keep concurrent callers in any thread from colliding on the same tmp path. 4. Log best-effort cleanup failures at debug so repeated tmp leaks surface in troubleshooting without spamming normal runs (per Claude review's minor suggestion). Impact: 2 functions changed, 1 affected
|
Addressed review feedback: Greptile P2 — non-force TOCTOU window: Replaced the final Greptile P2 — race test didn't exercise concurrency: Replaced the Claude minor suggestion — debug-log cleanup failures: Best-effort Incidental fix — worker_thread pid collisions: All 29 tests in |
|
Claude finished @carlos-alm's task —— View job Review of TOCTOU Race Fixes
Assessment: Excellent Fixes ⭐⭐⭐⭐⭐The changes effectively address both P2 issues from Greptile's review with well-engineered solutions: 🔧 Fixed Issue #1: Race Test Concurrencysrc/features/snapshot.ts: Added tests/unit/snapshot-race-worker.mjs: Clean worker_threads harness that:
Updated test: Now uses 🔧 Fixed Issue #2: Non-Force TOCTOUKey change (lines 84-92): Replaced
New regression test: 🎯 Code Quality Improvements
🧪 Test Coverage
🚀 Implementation DetailsThe
Final RecommendationLGTM - The fixes are well-engineered, thoroughly tested, and address the exact race conditions identified. The worker_threads test harness is a nice addition that could be reused for other concurrency testing. Ready to merge. |
Summary
snapshotSave --forcenow writesVACUUM INTOto a per-pid temp file and atomicallyfs.renameSyncs over the destination. Two concurrent saves can no longer see a missing file or interleave their VACUUM writes.snapshotRestorecopies the snapshot to a temp path next to the live DB and atomically renames, so readers never observe a partially-written database file. WAL/SHM sidecar removal switched tounlink+ ENOENT catch.snapshotDeletecollapses theexistsSync/unlinkSyncpair into a singleunlinkSync, mapping ENOENT to the existing "not found" error.Public signatures of
snapshotSave/snapshotRestore/snapshotDeleteare unchanged; CLI callers insrc/cli/commands/snapshot.tsneed no updates.Closes #995
Test plan
leaves no temp files in snapshots dir after success— ensures the temp artifact is renamed (not leaked) on the success path.does not corrupt output when two --force saves race on the same name— issues two concurrentsnapshotSave(..., { force: true })calls viaPromise.allSettledand asserts the surviving file is still a valid SQLite DB with the expected contents. Pre-fix this could produce a truncated/interleaved file.biome checkclean on both changed files.codegraph snapshot save checkpoint --forceagainst the same.codegraph/no longer produce a truncated destination.