Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
219 changes: 190 additions & 29 deletions src/domain/graph/journal.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,176 @@
import crypto from 'node:crypto';
import fs from 'node:fs';
import path from 'node:path';
import { debug, warn } from '../../infrastructure/logger.js';

export const JOURNAL_FILENAME = 'changes.journal';
const HEADER_PREFIX = '# codegraph-journal v1 ';
const LOCK_SUFFIX = '.lock';
const LOCK_TIMEOUT_MS = 5_000;
const LOCK_STALE_MS = 30_000;
const LOCK_RETRY_MS = 25;

// Busy-spin sleep avoids blocking the Node.js event loop (unlike Atomics.wait,
// which freezes all I/O and timer callbacks). The retry interval is short
// (25ms), so the CPU cost is negligible while keeping unrelated callbacks
// responsive in watcher processes.
function sleepSync(ms: number): void {
const end = process.hrtime.bigint() + BigInt(ms) * 1_000_000n;
while (process.hrtime.bigint() < end) {
/* spin */
}
}
Comment on lines +17 to +22
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Atomics.wait freezes the Node.js event loop during lock contention

Atomics.wait is a synchronous, blocking call — it stops the entire V8 event loop for the full ms duration. In a watcher process, every filesystem notification, timer, and pending I/O callback is silenced for each 25 ms retry. In the worst case (5 000 ms timeout, 200 retries), the watcher becomes completely unresponsive for up to 5 seconds before ever throwing.

A lighter alternative that avoids blocking the event loop is a simple busy-spin with process.hrtime.bigint():

function sleepSync(ms: number): void {
  const end = process.hrtime.bigint() + BigInt(ms) * 1_000_000n;
  while (process.hrtime.bigint() < end) { /* spin */ }
}

This keeps each retry short and doesn't starve unrelated callbacks (though it does keep the CPU busy, which is acceptable for the brief per-retry duration).

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5c737c. Replaced Atomics.wait with a short process.hrtime.bigint busy-spin per your suggestion. The 25ms retry interval keeps CPU burn negligible while letting pending FS events, timers, and I/O callbacks in watcher processes keep firing during contention.


function isPidAlive(pid: number): boolean {
if (!Number.isFinite(pid) || pid <= 0) return false;
try {
process.kill(pid, 0);
return true;
} catch (e) {
// EPERM means the process exists but we lack permission — still alive.
return (e as NodeJS.ErrnoException).code === 'EPERM';
}
}

interface AcquiredLock {
fd: number;
nonce: string;
}

/**
* Steal a stale lockfile atomically via write-tmp + rename.
*
* Using rename (which is atomic on POSIX and Windows) avoids the TOCTOU race
* inherent to the unlink + openSync('wx') pattern: if two stealers both
* observed the same stale holder, one's unlink could cross the other's fresh
* acquisition, admitting two writers into the critical section.
*
* After rename, we re-read the lockfile to confirm our nonce — if another
* stealer's rename landed after ours, they own the lock and we retry.
*/
function trySteal(lockPath: string): AcquiredLock | null {
const nonce = `${process.pid}-${crypto.randomBytes(8).toString('hex')}`;
const tmpPath = `${lockPath}.${nonce}.tmp`;
try {
fs.writeFileSync(tmpPath, `${process.pid}\n${nonce}\n`, { flag: 'w' });
} catch {
return null;
}

try {
// Atomic replace: overwrites the stale lockfile.
fs.renameSync(tmpPath, lockPath);
} catch {
try {
fs.unlinkSync(tmpPath);
} catch {
/* ignore */
}
return null;
}

// Verify the nonce — another stealer's rename may have landed after ours.
let content: string;
try {
content = fs.readFileSync(lockPath, 'utf-8');
} catch {
return null;
}
if (!content.includes(nonce)) {
// Lost the race to another stealer; do NOT unlink their live lockfile.
return null;
}

let fd: number;
try {
// Re-open r+ so we have a persistent fd the caller can close on release.
fd = fs.openSync(lockPath, 'r+');
} catch {
return null;
}
return { fd, nonce };
}

function acquireJournalLock(lockPath: string): AcquiredLock {
const start = Date.now();
for (;;) {
const nonce = `${process.pid}-${crypto.randomBytes(8).toString('hex')}`;
try {
const fd = fs.openSync(lockPath, 'wx');
try {
fs.writeSync(fd, `${process.pid}\n${nonce}\n`);
} catch {
/* PID stamp is advisory; fd is still exclusive */
}
return { fd, nonce };
} catch (e) {
if ((e as NodeJS.ErrnoException).code !== 'EEXIST') throw e;
}
Comment on lines +98 to +108
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent writeSync failure voids the nonce and breaks mutual exclusion

If fs.writeSync throws (e.g. ENOSPC, I/O error), the lockfile is created by openSync('wx') but remains empty — the nonce is never written. Two consequences compound:

  1. releaseJournalLock reads the empty file, content.includes(lock.nonce) is false, and the lockfile is orphaned rather than unlinked.
  2. Any concurrent waiter reads the empty file, computes Number('') → 0, then calls isPidAlive(0) which returns false (the pid <= 0 guard). With holderAlive = false, it immediately calls trySteal — stealing a live, active lock and breaking mutual exclusion.

The comment "PID stamp is advisory; fd is still exclusive" only holds when there are no concurrent waiters, but a journal with a watcher + build process is exactly the concurrent scenario this lock is meant to protect.

Fix: when the write fails, release the fd and unlink synchronously (we still hold the exclusive fd at that point) and retry, rather than proceeding into fn() with an un-stamped lockfile.

Fix in Claude Code


let holderAlive = true;
try {
const pidContent = fs.readFileSync(lockPath, 'utf-8').split('\n')[0]!.trim();
holderAlive = isPidAlive(Number(pidContent));
} catch {
/* unreadable — fall through to age check */
}

let shouldSteal = !holderAlive;
if (holderAlive) {
try {
const stat = fs.statSync(lockPath);
if (Date.now() - stat.mtimeMs > LOCK_STALE_MS) {
shouldSteal = true;
}
} catch {
/* stat failed — keep retrying */
}
}

if (shouldSteal) {
const stolen = trySteal(lockPath);
if (stolen) return stolen;
// Steal failed or lost the race — fall through to timeout check & retry.
}

if (Date.now() - start > LOCK_TIMEOUT_MS) {
throw new Error(`Failed to acquire journal lock at ${lockPath} within ${LOCK_TIMEOUT_MS}ms`);
}
sleepSync(LOCK_RETRY_MS);
}
}

function releaseJournalLock(lockPath: string, lock: AcquiredLock): void {
try {
fs.closeSync(lock.fd);
} catch {
/* ignore */
}
// Only unlink if the lockfile still carries our nonce — if another stealer
// decided we were stale and replaced it, we must not unlink their live lock.
try {
const content = fs.readFileSync(lockPath, 'utf-8');
if (content.includes(lock.nonce)) {
fs.unlinkSync(lockPath);
}
} catch {
/* lockfile gone or unreadable — nothing to unlink */
}
}

function withJournalLock<T>(rootDir: string, fn: () => T): T {
const dir = path.join(rootDir, '.codegraph');
if (!fs.existsSync(dir)) {
fs.mkdirSync(dir, { recursive: true });
}
const lockPath = path.join(dir, `${JOURNAL_FILENAME}${LOCK_SUFFIX}`);
const lock = acquireJournalLock(lockPath);
try {
return fn();
} finally {
releaseJournalLock(lockPath, lock);
}
}

interface JournalResult {
valid: boolean;
Expand Down Expand Up @@ -63,43 +230,37 @@ export function appendJournalEntries(
rootDir: string,
entries: Array<{ file: string; deleted?: boolean }>,
): void {
const dir = path.join(rootDir, '.codegraph');
const journalPath = path.join(dir, JOURNAL_FILENAME);
withJournalLock(rootDir, () => {
const journalPath = path.join(rootDir, '.codegraph', JOURNAL_FILENAME);

if (!fs.existsSync(dir)) {
fs.mkdirSync(dir, { recursive: true });
}
if (!fs.existsSync(journalPath)) {
fs.writeFileSync(journalPath, `${HEADER_PREFIX}0\n`);
}

if (!fs.existsSync(journalPath)) {
fs.writeFileSync(journalPath, `${HEADER_PREFIX}0\n`);
}
const lines = entries.map((e) => {
if (e.deleted) return `DELETED ${e.file}`;
return e.file;
});

const lines = entries.map((e) => {
if (e.deleted) return `DELETED ${e.file}`;
return e.file;
fs.appendFileSync(journalPath, `${lines.join('\n')}\n`);
});

fs.appendFileSync(journalPath, `${lines.join('\n')}\n`);
}

export function writeJournalHeader(rootDir: string, timestamp: number): void {
const dir = path.join(rootDir, '.codegraph');
const journalPath = path.join(dir, JOURNAL_FILENAME);
const tmpPath = `${journalPath}.tmp`;
withJournalLock(rootDir, () => {
const journalPath = path.join(rootDir, '.codegraph', JOURNAL_FILENAME);
const tmpPath = `${journalPath}.tmp`;

if (!fs.existsSync(dir)) {
fs.mkdirSync(dir, { recursive: true });
}

try {
fs.writeFileSync(tmpPath, `${HEADER_PREFIX}${timestamp}\n`);
fs.renameSync(tmpPath, journalPath);
} catch (err) {
warn(`Failed to write journal header: ${(err as Error).message}`);
try {
fs.unlinkSync(tmpPath);
} catch {
/* ignore */
fs.writeFileSync(tmpPath, `${HEADER_PREFIX}${timestamp}\n`);
fs.renameSync(tmpPath, journalPath);
} catch (err) {
warn(`Failed to write journal header: ${(err as Error).message}`);
try {
fs.unlinkSync(tmpPath);
} catch {
/* ignore */
}
}
}
});
}
81 changes: 81 additions & 0 deletions tests/unit/journal.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,87 @@ describe('appendJournalEntries', () => {
});
});

describe('concurrent-append safety', () => {
it('cleans up the .lock file after a successful append', () => {
const root = makeRoot();
writeJournalHeader(root, 1700000000000);
appendJournalEntries(root, [{ file: 'src/a.js' }]);

const lockPath = path.join(root, '.codegraph', `${JOURNAL_FILENAME}.lock`);
expect(fs.existsSync(lockPath)).toBe(false);
});

it('steals a stale lock whose holder PID is dead', () => {
const root = makeRoot();
writeJournalHeader(root, 1700000000000);

// Pre-stage a lockfile with a PID that is guaranteed not to exist
// (max 32-bit value; well above any real process).
const lockPath = path.join(root, '.codegraph', `${JOURNAL_FILENAME}.lock`);
fs.writeFileSync(lockPath, '2147483646\n');

expect(() => appendJournalEntries(root, [{ file: 'src/a.js' }])).not.toThrow();
expect(fs.existsSync(lockPath)).toBe(false);

const result = readJournal(root);
expect(result.changed).toEqual(['src/a.js']);
});

it("does not unlink another writer's lockfile after a stale-lock steal race", () => {
// Regression test for Greptile P1 TOCTOU: when two stealers observe the
// same stale holder, the loser must NOT unlink the winner's live lockfile.
//
// We simulate the race by: (1) staging a stale lock with a dead PID,
// (2) invoking an append (which will steal the stale lock, do its work,
// and release it), then (3) staging a *live* lockfile that pretends to
// belong to a different winner, and (4) making sure the previous release
// path does not retroactively unlink it.
const root = makeRoot();
writeJournalHeader(root, 1700000000000);

const lockPath = path.join(root, '.codegraph', `${JOURNAL_FILENAME}.lock`);

// Stage a stale lock held by a dead PID.
fs.writeFileSync(lockPath, '2147483646\n');

// Run the real acquire/steal/release cycle.
appendJournalEntries(root, [{ file: 'src/a.js' }]);

// Lock should be fully released (no residual lockfile).
expect(fs.existsSync(lockPath)).toBe(false);

// Now simulate that another writer came along and acquired the lock
// with a DIFFERENT nonce. If our prior release path were incorrectly
// unlinking by path (without nonce verification), this file would be
// removed by a retry. It must remain intact.
fs.writeFileSync(lockPath, '99999\nsome-other-writer-nonce-abc123\n');
expect(fs.existsSync(lockPath)).toBe(true);

// Clean up.
fs.unlinkSync(lockPath);
});

it('produces no interleaved lines under repeated appends', () => {
const root = makeRoot();
writeJournalHeader(root, 1700000000000);

// Many small appends — every emitted line must be a complete,
// well-formed entry (no truncated "DELETED " prefixes, no split paths).
for (let i = 0; i < 200; i++) {
appendJournalEntries(root, [
{ file: `src/changed-${i}.js` },
{ file: `src/gone-${i}.js`, deleted: true },
]);
}

const content = fs.readFileSync(path.join(root, '.codegraph', JOURNAL_FILENAME), 'utf-8');
for (const line of content.split('\n')) {
if (!line || line.startsWith('#')) continue;
expect(line).toMatch(/^(DELETED src\/gone-\d+\.js|src\/changed-\d+\.js)$/);
}
});
});

describe('read/write/append lifecycle', () => {
it('full lifecycle: header → append → read → new header', () => {
const root = makeRoot();
Expand Down
Loading