fix: return ok=false from cache lookup when storage is missing (follow-up to #223) by mpecan · Pull Request #230 · falcondev-oss/github-actions-cache-server

mpecan · 2026-04-18T17:49:24Z

Follow-up to #223, which returned 404 from /download/:cacheEntryId when the backing storage object was missing. That fix resolved the 500/hang for clients that retry on 5xx and fall back on 4xx — notably the @actions/cache library used by GitHub Actions workflows.

Problem

Container-build tooling (BuildKit's GHA cache-import, at minimum) treats any non-200 response on the download URL as a hard build failure, not a cache miss. Concretely:

COPY api/ api/
ERROR: failed to copy: GET http://gha-cache-server/download/<uuid>
RESPONSE 404: 404 Server Error

The workflow aborts instead of rebuilding from source. After #223 landed, workflows that previously hung now fail fast with the same net effect: broken builds whenever an entry's storage is out-of-sync with the DB (bucket wipe, lifecycle-deleted object, unfinalized upload).

The cache-miss signal that BuildKit (and any spec-conforming client) actually observes is ok: false from the twirp GetCacheEntryDownloadURL lookup — not a 404 on the download URL it was handed. As long as the lookup returns a download URL, the client has committed to downloading, and any error at that stage is a failure.

Solution

Validate storage availability in Storage.getCacheEntryWithDownloadUrl before handing back a download URL. If the matched entry's storage is missing, purge the stale cache_entries row and re-run matchCacheEntry so restore keys can still fall through to a valid candidate. If no valid match remains, return undefined → the route returns { ok: false } → the client treats it as a clean cache miss.

This also self-heals the DB proactively rather than leaving cleanup to the 30-day cleanup:cache-entries task, so repeated workflow runs don't keep rediscovering the same stale row.

Changes

New StorageAdapter.objectExists(objectName) on S3 (HeadObjectCommand), filesystem (fs.access), and GCS (file.exists()). Used for merged-blob existence checks — cheaper than countFilesInFolder on a single object.
Storage.storageHasData(location) private helper: returns true iff mergedAt is set and the merged blob exists, or no merge has happened and partCount matches what's on disk. partsDeletedAt with no mergedAt implies data loss → returns false.
Storage.getCacheEntryWithDownloadUrl now loops (bounded at 10 attempts): on each iteration it matches, validates the match's storage, and — if invalid — deletes the cache_entries row and retries. First valid match wins; no valid match → undefined.
Test: falls back to a valid restore key when the best match has missing storage — saves a stale entry, wipes storage, saves a valid entry, then asserts that a restore with the stale key listed first still returns the valid one. Exercises the retry loop and the cross-key fallback.

Downsides / trade-offs

One extra storage call per lookup (HEAD for merged, LIST for parts). Negligible on S3/GCS; slightly more noticeable on filesystem but still sub-millisecond.
Self-healing on the read path (DELETE FROM cache_entries) was deliberately avoided in fix: return 404 for stale cache entries with missing storage objects #223. Reintroducing it here is the minimum needed to keep a single lookup well-formed for clients that can't retry at a higher level — BuildKit chief among them. The cleanup task remains the backstop for orphaned storage_locations.
Bounded retry (10) guards against a pathological scan if a flood of stale rows exists for the same prefix. In practice, one or two iterations suffice.

Testing

pnpm run type-check ✓
pnpm run lint ✓
pnpm run test:run ✓ — 8/8 pass including the new case and the two from fix: return 404 for stale cache entries with missing storage objects #223.

Representative server log confirming the fix on a real BuildKit request (Garage-backed S3, post-bucket-wipe):

[warn] Cache entry 7cd91bc9-... (setup-rust-linux-x86_64-cargo) points at missing storage, purging.
[debug] Response: POST /twirp/github.actions.results.api.v1.CacheService/GetCacheEntryDownloadURL > 200 {"ok":false}

The workflow then builds from source as expected instead of aborting on 404.

Notes

No schema changes.
No public HTTP contract changes — stale entries now cause the lookup to respond { ok: false } where (after fix: return 404 for stale cache entries with missing storage objects #223) the lookup would have succeeded and then the download would have 404'd.

BuildKit treats a 404 on /download/<id> as a hard build failure, not a cache miss. Returning the 404 at download time (previous fix) still breaks builds because the twirp lookup hands BuildKit a download URL that later 404s. Validate storage has the data in getCacheEntryWithDownloadUrl: if the matched entry's merged blob is missing (or parts folder is incomplete), purge the stale cache_entries row and retry matching. BuildKit then sees ok=false and treats it as a cache miss, building normally. Adds objectExists to StorageAdapter (S3 HeadObject, fs.access, GCS exists) and a restore-key fallback test covering the retry loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: return ok=false from cache lookup when storage is missing (follow-up to #223)#230

fix: return ok=false from cache lookup when storage is missing (follow-up to #223)#230
mpecan wants to merge 1 commit intofalcondev-oss:devfrom
mpecan:dev

mpecan commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mpecan commented Apr 18, 2026

Problem

Solution

Changes

Downsides / trade-offs

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant