Fix bugs from code audit (server respawn, leaks, hardening)#4427
Fix bugs from code audit (server respawn, leaks, hardening)#4427evanpelle wants to merge 1 commit into
Conversation
Verified findings from a code audit and fixed the legitimate, well-scoped ones. Notable: crashed cluster workers were never respawned because worker.process.env is undefined in the exit handler, so WORKER_ID was always undefined (confirmed empirically). Fixes: - Master: respawn crashed workers via a worker.id -> WORKER_ID registry instead of the always-undefined worker.process.env.WORKER_ID. - GameServer: report Finished when _hasEnded so abandoned private lobbies are cleaned up instead of lingering until max duration; remove sockets from the websockets Set on close/rejoin; reset hasReachedMaxPlayerCount when a player leaves before start (avoids premature under-capacity starts, e.g. ranked 1v1). - Worker: reap WebSocket connections that upgrade but never authenticate (30s join timeout) to prevent Slowloris-style FD exhaustion. - ServerEnv: add a 1h TTL to the cached JWKS public key so IdP key rotation doesn't break all logins until a restart. - SpawnExecution: ignore a client-supplied tile in random-spawn mode so a forged spawn intent can't pick its own coordinates (+ test). - GameRenderer / MultiTabDetector: remove window listeners on teardown (bind once so add/removeEventListener references match). - LocalPersistantStats: guard localStorage access with try/catch so sandboxed iframes / disabled-storage contexts don't crash game start. - StructurePass: reuse per-frame uniform arrays instead of allocating each draw. Also documents a known, intentionally-unfixed auth gap on /api/archive_singleplayer_game (pending a product decision on logged-out users). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
WalkthroughMultiple focused bug fixes and hardening patches: client renderer teardown now calls ChangesClient Resource Cleanup & Resilience
Server Security & Lifecycle Hardening
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/server/ServerEnv.ts (1)
96-119: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winKeep using the old key when refetch fails, so an IdP outage does not break all logins.
Before this change, once the key was cached it worked forever. Now, after the 1 hour TTL passes, every call goes back to
fetch. If the JWKS endpoint is briefly down at that moment, this method throws,verifyClientTokenreturns an error, and every join/rejoin getsws.close(1002)— even for players whose tokens the old cached key could still verify.The TTL goal (pick up key rotation) and resilience are both possible: try to refresh, but fall back to the still-cached key if the refresh fails.
🛡️ Fall back to the cached key on refresh failure
static async jwkPublicKey(): Promise<JWK> { if ( ServerEnv.publicKey && Date.now() - ServerEnv.publicKeyFetchedAt < ServerEnv.PUBLIC_KEY_TTL_MS ) { return ServerEnv.publicKey; } - const jwksUrl = ServerEnv.jwtIssuer() + "/.well-known/jwks.json"; - console.log(`Fetching JWKS from ${jwksUrl}`); - const response = await fetch(jwksUrl); - if (!response.ok) { - const body = await response.text(); - throw new Error(`JWKS fetch failed: ${response.status} ${body}`); - } - const result = JwksSchema.safeParse(await response.json()); - if (!result.success) { - const error = z.prettifyError(result.error); - console.error("Error parsing JWKS", error); - throw new Error("Invalid JWKS"); - } - ServerEnv.publicKey = result.data.keys[0]; - ServerEnv.publicKeyFetchedAt = Date.now(); - return ServerEnv.publicKey; + const jwksUrl = ServerEnv.jwtIssuer() + "/.well-known/jwks.json"; + console.log(`Fetching JWKS from ${jwksUrl}`); + try { + const response = await fetch(jwksUrl); + if (!response.ok) { + const body = await response.text(); + throw new Error(`JWKS fetch failed: ${response.status} ${body}`); + } + const result = JwksSchema.safeParse(await response.json()); + if (!result.success) { + const error = z.prettifyError(result.error); + console.error("Error parsing JWKS", error); + throw new Error("Invalid JWKS"); + } + ServerEnv.publicKey = result.data.keys[0]; + ServerEnv.publicKeyFetchedAt = Date.now(); + return ServerEnv.publicKey; + } catch (e) { + // Refresh failed. Keep serving the stale key if we have one so a + // transient IdP outage does not break all token verification. + if (ServerEnv.publicKey) { + console.error("JWKS refresh failed, using stale cached key", e); + return ServerEnv.publicKey; + } + throw e; + } }One small follow-on: when the TTL has passed and many sockets connect at once, each call runs its own
fetch(a small thundering-herd). If that ever matters, store the in-flightPromise<JWK>and share it. Optional, not blocking.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/server/ServerEnv.ts` around lines 96 - 119, The jwkPublicKey() refresh path currently throws on JWKS fetch/parsing failures even when a valid ServerEnv.publicKey is still cached. Update ServerEnv.jwkPublicKey() to try refreshing after the TTL, but if fetch or parsing fails, fall back to the existing cached key instead of failing token verification. Keep the existing cache update behavior when refresh succeeds, and use the cached key only when it is present so verifyClientToken can continue working during IdP outages.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/server/GameServer.ts`:
- Around line 729-733: The race-path cleanup in GameServer leaves a disconnected
client’s socket in this.websockets even after removing it from activeClients,
which can leave dead entries behind. Update the same cleanup branch in
GameServer’s end/addListeners flow to also delete client.ws from
this.websockets, mirroring the normal "close" handler so abandoned sockets are
fully removed.
---
Outside diff comments:
In `@src/server/ServerEnv.ts`:
- Around line 96-119: The jwkPublicKey() refresh path currently throws on JWKS
fetch/parsing failures even when a valid ServerEnv.publicKey is still cached.
Update ServerEnv.jwkPublicKey() to try refreshing after the TTL, but if fetch or
parsing fails, fall back to the existing cached key instead of failing token
verification. Keep the existing cache update behavior when refresh succeeds, and
use the cached key only when it is present so verifyClientToken can continue
working during IdP outages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 596d489d-e5db-4cdd-8b5d-acd2a5575e31
📒 Files selected for processing (11)
src/client/ClientGameRunner.tssrc/client/LocalPersistantStats.tssrc/client/MultiTabDetector.tssrc/client/hud/GameRenderer.tssrc/client/render/gl/passes/StructurePass.tssrc/core/execution/SpawnExecution.tssrc/server/GameServer.tssrc/server/Master.tssrc/server/ServerEnv.tssrc/server/Worker.tstests/RandomSpawnNoOverride.test.ts
| if ( | ||
| this.activeClients.length < (this.gameConfig.maxPlayers ?? Infinity) | ||
| ) { | ||
| this.hasReachedMaxPlayerCount = false; | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Mirror the websocket-set cleanup in this race path.
Line 723 removes the client from activeClients, but this branch still leaves client.ws inside this.websockets. If the socket was already closed before addListeners() ran, end() keeps iterating a dead entry and the set can grow with abandoned sockets. Add the same delete used in the normal "close" handler.
Proposed fix
if (client.ws.readyState >= 2) {
this.log.info("client WebSocket already closing/closed, removing", {
clientID: client.clientID,
readyState: client.ws.readyState,
});
+ this.websockets.delete(client.ws);
this.activeClients = this.activeClients.filter(
(c) => c.clientID !== client.clientID,
);
// Remove persistentId if the game has not started to prevent going over max players
if (!this._hasStarted) {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if ( | |
| this.activeClients.length < (this.gameConfig.maxPlayers ?? Infinity) | |
| ) { | |
| this.hasReachedMaxPlayerCount = false; | |
| } | |
| if (client.ws.readyState >= 2) { | |
| this.log.info("client WebSocket already closing/closed, removing", { | |
| clientID: client.clientID, | |
| readyState: client.ws.readyState, | |
| }); | |
| this.websockets.delete(client.ws); | |
| this.activeClients = this.activeClients.filter( | |
| (c) => c.clientID !== client.clientID, | |
| ); | |
| // Remove persistentId if the game has not started to prevent going over max players | |
| if (!this._hasStarted) { |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/server/GameServer.ts` around lines 729 - 733, The race-path cleanup in
GameServer leaves a disconnected client’s socket in this.websockets even after
removing it from activeClients, which can leave dead entries behind. Update the
same cleanup branch in GameServer’s end/addListeners flow to also delete
client.ws from this.websockets, mirroring the normal "close" handler so
abandoned sockets are fully removed.
Verified the findings from a code audit against the real code and fixed the legitimate, well-scoped ones. Each fix below was confirmed by reading the actual code (and, for the worker-respawn bug, by running an actual Node cluster experiment).
Fixes
Server
Master.ts) —worker.process.envisundefinedin the clusterexithandler, soWORKER_IDwas always undefined → every crash logged "could not find id" and returned without respawning. Now tracked via aworker.id → WORKER_IDregistry.GameServer.ts) —phase()ignored_hasEnded, so a host-left lobby stayedLobbyand wasn't GC'd until max duration. ReturnsFinishednow.websocketsSet kept dead sockets (GameServer.ts) — never removed on close/rejoin; added deletes.hasReachedMaxPlayerCountlatched forever (GameServer.ts) — a leave-after-full committed a degenerate under-capacity start (notably ranked 1v1). Reset on leave when back under capacity.Worker.ts) — pre-join sockets are untracked by any GameServer → Slowloris-style FD exhaustion. Added a 30s join-reaper.ServerEnv.ts) — IdP key rotation would break all logins until restart. Added a 1h TTL.Core
SpawnExecution.ts) — a forgedspawnintent setthis.tile, whichgetSpawn(center)honored. Now ignored in random-spawn mode. Addedtests/RandomSpawnNoOverride.test.ts.Client
addEventListener(..., x.bind(this))thenremoveEventListener(..., x.bind(this))use different refs; bind once.destroy()wired intodisposeRenderer.localStoragecrash in sandboxed iframes (LocalPersistantStats.ts) — accessinglocalStoragethrows in sandboxed portals (e.g. CrazyGames); the old=== undefinedguard threw too. Wrapped in try/catch.Float32Arrayalloc (StructurePass.draw()) — hoisted to reused fields (matches existingghostBuf).Known gap, intentionally not fixed here
/api/archive_singleplayer_gameis unauthenticated and trusts the client-suppliedpersistentID. Documented inline with the recommended fix (require a Bearer JWT, stamppersistentIDfrom the verified token). Left for a follow-up pending a product decision on logged-out singleplayer users.Deliberately rejected as false positives
The audit's entire "floating-point divergence" determinism section (attack math, nuke
atan2, bezier, A* tie-breakers, RNG seeds, Set/Map iteration) — JS IEEE-754 math is deterministic, the sim already relies onMath.pow/Math.expevery tick, and several proposed fixes would have changed deterministic outputs and broken replays. Also rejected: desync-relay (misreads the intent architecture), WebGL VAO "leak" (documented intentional — context is torn down), and a handful of not-reachable / already-bounded items.Verification
tsc --noEmit✓eslint✓vitest— 1634/1634 tests pass (147 files), including the new test