fix(journal): serialize concurrent appends via lockfile#1002
fix(journal): serialize concurrent appends via lockfile#1002carlos-alm wants to merge 4 commits intomainfrom
Conversation
appendJournalEntries used fs.appendFileSync with no cross-process coordination, so a watcher session and a manual codegraph build in a second shell could interleave lines and corrupt .codegraph/changes.journal (truncated DELETED prefixes, partial entries, newline-less tails). Wrap both appendJournalEntries and writeJournalHeader in a withJournalLock helper that: - acquires .codegraph/changes.journal.lock via fs.openSync(path, 'wx') (atomic exclusive-create, cross-platform) - stamps the holder PID for stale-lock detection - retries every 25ms up to a 5s timeout - steals the lock if the holder PID is dead or the lock file is older than 30s (covers crash-mid-write) - always releases in finally Zero new dependencies — uses only node:fs and Atomics.wait on a SharedArrayBuffer for synchronous retry sleep, preserving the existing sync API. Fixes #996 Impact: 7 functions changed, 12 affected
|
Claude finished @carlos-alm's task —— View job PR Review: fix(journal): serialize concurrent appends via lockfileTodo List
Root Cause AnalysisProblem Being Solved: Race condition where concurrent journal writers (watch session + manual Why This Approach: Uses atomic file locking via Risk Assessment: LOW - Well-implemented defensive approach with proper timeout, dead PID detection, and stale lock cleanup. Backlog Compliance
Technical ReviewImplementation Quality ⭐⭐⭐⭐☆Strong Points:
Implementation Details:
Test Coverage ⭐⭐⭐⭐☆New Tests Added:
Test Quality:
Potential Issues (Minor)
Critical ConcernsNone. This is a solid fix for a real concurrency issue. Final Recommendation
The only reason this isn't 5 stars is that it's solving an existing issue rather than adding exceptional new value, but the execution is excellent. |
Greptile SummaryThis PR wraps Confidence Score: 5/5Safe to merge; all remaining findings are P2 style/quality suggestions that do not block correctness. Both prior P1 issues (TOCTOU steal race, event-loop blocking) are addressed in this revision. New findings are limited to orphaned crash-time tmp files (minor filesystem hygiene) and a test that validates format integrity but not inter-process concurrency — neither affects the lock's correctness on the happy path. No files require special attention beyond the P2 notes on trySteal cleanup and the concurrent-append test scope. Important Files Changed
Sequence DiagramsequenceDiagram
participant W as Watcher Process
participant B as Build Process
participant FS as Filesystem (.codegraph/)
W->>FS: openSync('changes.journal.lock', 'wx') → fd_W
W->>FS: writeSync(fd_W, PID_W + nonce_W)
note over W: holds lock
B->>FS: openSync('changes.journal.lock', 'wx') → EEXIST
B->>FS: readFileSync(lock) → PID_W (alive)
B->>FS: statSync(lock) → mtime recent
note over B: spins sleepSync(25ms) × N
W->>FS: appendFileSync('changes.journal', lines)
W->>FS: readFileSync(lock) → nonce matches → unlinkSync(lock)
note over W: releases lock
B->>FS: openSync('changes.journal.lock', 'wx') → fd_B
B->>FS: writeSync(fd_B, PID_B + nonce_B)
B->>FS: appendFileSync('changes.journal', lines)
B->>FS: readFileSync(lock) → nonce matches → unlinkSync(lock)
note over B: releases lock
Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/journal-ato..." | Re-trigger Greptile |
| let holderAlive = true; | ||
| try { | ||
| const pidContent = fs.readFileSync(lockPath, 'utf-8').trim(); | ||
| holderAlive = isPidAlive(Number(pidContent)); | ||
| } catch { | ||
| /* unreadable — fall through to age check */ | ||
| } | ||
|
|
||
| if (!holderAlive) { | ||
| try { | ||
| fs.unlinkSync(lockPath); | ||
| } catch { | ||
| /* another writer stole it first */ | ||
| } | ||
| continue; |
There was a problem hiding this comment.
TOCTOU race allows two processes to concurrently enter the critical section
When multiple processes are waiting on the same stale (dead-PID) lockfile, they can each independently steal the lock from one another's fresh acquisition:
- Process A dies — lockfile contains dead PID A.
- Processes B and C both call
openSync('wx'), both getEEXIST, and both read dead PID A. - B unlinks and immediately re-acquires (new inode I1,
fn()starts). - C now executes
unlinkSync(lockPath)— it unlinks B's live lockfile (I1), not the stale one. - C loops, calls
openSync('wx'), succeeds (new inode I2), and itsfn()begins concurrently with B's.
Both B and C are now inside the critical section simultaneously — exactly the write-serialisation invariant this PR is meant to guarantee.
The standard mitigation is to verify ownership after the unlink/re-create cycle: write a nonce (or use the inode number) inside the lockfile, then re-read it after openSync succeeds and bail if it doesn't match yours. Alternatively, use fs.renameSync(tmpfile, lockPath) to perform the steal atomically (create a temp file, rename onto the lockfile — POSIX rename is atomic).
There was a problem hiding this comment.
Fixed in f5c737c. Replaced the unlink + openSync('wx') steal pattern with an atomic write-tmp + fs.renameSync steal. Each writer generates a random nonce, writes it into a temp file, and atomically renames onto the lockfile. After rename we re-read the lockfile and only proceed if our nonce is still there — if another stealer's rename landed after ours, we bail and retry instead of unlinking their live lock. Release also nonce-verifies before unlinking. Added a regression test that stages a lockfile with a different-writer nonce after a real steal cycle and asserts we do not retroactively unlink it.
| function sleepSync(ms: number): void { | ||
| const buf = new Int32Array(new SharedArrayBuffer(4)); | ||
| Atomics.wait(buf, 0, 0, ms); | ||
| } |
There was a problem hiding this comment.
Atomics.wait freezes the Node.js event loop during lock contention
Atomics.wait is a synchronous, blocking call — it stops the entire V8 event loop for the full ms duration. In a watcher process, every filesystem notification, timer, and pending I/O callback is silenced for each 25 ms retry. In the worst case (5 000 ms timeout, 200 retries), the watcher becomes completely unresponsive for up to 5 seconds before ever throwing.
A lighter alternative that avoids blocking the event loop is a simple busy-spin with process.hrtime.bigint():
function sleepSync(ms: number): void {
const end = process.hrtime.bigint() + BigInt(ms) * 1_000_000n;
while (process.hrtime.bigint() < end) { /* spin */ }
}This keeps each retry short and doesn't starve unrelated callbacks (though it does keep the CPU busy, which is acceptable for the brief per-retry duration).
There was a problem hiding this comment.
Fixed in f5c737c. Replaced Atomics.wait with a short process.hrtime.bigint busy-spin per your suggestion. The 25ms retry interval keeps CPU burn negligible while letting pending FS events, timers, and I/O callbacks in watcher processes keep firing during contention.
Codegraph Impact Analysis10 functions changed → 13 callers affected across 5 files
|
Address two Greptile review issues on the journal lockfile:
- P1 TOCTOU: when two stealers observed the same stale (dead-PID)
holder, one's unlink could cross the other's fresh openSync('wx')
acquisition, admitting both writers into the critical section. Replace
the unlink + openSync('wx') pattern with an atomic write-tmp + rename
steal, then verify via a random nonce. If another stealer's rename
landed after ours, we bail and retry instead of unlinking their live
lockfile. Release now also nonce-verifies before unlinking.
- P2 event-loop blockage: replace Atomics.wait with a short hrtime
busy-spin so pending FS events and timer callbacks in the watcher
keep firing during the 25ms retry window.
Add a regression test for the stale-lock steal race that asserts we
never unlink a lockfile whose nonce does not match our own.
Impact: 7 functions changed, 10 affected
| try { | ||
| const fd = fs.openSync(lockPath, 'wx'); | ||
| try { | ||
| fs.writeSync(fd, `${process.pid}\n${nonce}\n`); | ||
| } catch { | ||
| /* PID stamp is advisory; fd is still exclusive */ | ||
| } | ||
| return { fd, nonce }; | ||
| } catch (e) { | ||
| if ((e as NodeJS.ErrnoException).code !== 'EEXIST') throw e; | ||
| } |
There was a problem hiding this comment.
Silent
writeSync failure voids the nonce and breaks mutual exclusion
If fs.writeSync throws (e.g. ENOSPC, I/O error), the lockfile is created by openSync('wx') but remains empty — the nonce is never written. Two consequences compound:
releaseJournalLockreads the empty file,content.includes(lock.nonce)isfalse, and the lockfile is orphaned rather than unlinked.- Any concurrent waiter reads the empty file, computes
Number('') → 0, then callsisPidAlive(0)which returnsfalse(thepid <= 0guard). WithholderAlive = false, it immediately callstrySteal— stealing a live, active lock and breaking mutual exclusion.
The comment "PID stamp is advisory; fd is still exclusive" only holds when there are no concurrent waiters, but a journal with a watcher + build process is exactly the concurrent scenario this lock is meant to protect.
Fix: when the write fails, release the fd and unlink synchronously (we still hold the exclusive fd at that point) and retry, rather than proceeding into fn() with an un-stamped lockfile.
Summary
Fixes #996.
appendJournalEntries()relied onfs.appendFileSyncalone — that's atomic within a single call but not across concurrent writers. When a watch session and a manualcodegraph buildin another shell both write to.codegraph/changes.journal, their lines can interleave, producing truncatedDELETEDprefixes, partial paths, or a\n-less tail. A corrupted header then makesreadJournalreturnvalid: falseand silently falls through to a hash scan — a silent perf regression, not a crash.Fix
Wrap both
appendJournalEntriesandwriteJournalHeaderinwithJournalLock:.codegraph/changes.journal.lockwithfs.openSync(path, 'wx')— atomic exclusive-create on ext4/NTFS/APFS.process.kill(pid, 0)) or the file's mtime is older than 30s — covers crash-mid-write.finally.Zero new dependencies.
Atomics.waiton aSharedArrayBufferprovides synchronous retry sleep so the existing sync API is preserved (watcher and build stages call these functions synchronously).Test plan
npx vitest run tests/unit/journal.test.ts— 20/20 pass (17 existing + 3 new: lockfile cleanup, dead-PID stale-lock stealing, no-interleave across 200 mixed appends)npx vitest run tests/builder/detect-changes.test.ts tests/integration/build.test.ts tests/integration/watcher-rebuild.test.ts— 27/27 passnpm run typecheck— clean