Skip to content

fix: return ok=false from cache lookup when storage is missing (follow-up to #223)#230

Open
mpecan wants to merge 1 commit intofalcondev-oss:devfrom
mpecan:dev
Open

fix: return ok=false from cache lookup when storage is missing (follow-up to #223)#230
mpecan wants to merge 1 commit intofalcondev-oss:devfrom
mpecan:dev

Conversation

@mpecan
Copy link
Copy Markdown
Contributor

@mpecan mpecan commented Apr 18, 2026

Follow-up to #223, which returned 404 from /download/:cacheEntryId when the backing storage object was missing. That fix resolved the 500/hang for clients that retry on 5xx and fall back on 4xx — notably the @actions/cache library used by GitHub Actions workflows.

Problem

Container-build tooling (BuildKit's GHA cache-import, at minimum) treats any non-200 response on the download URL as a hard build failure, not a cache miss. Concretely:

COPY api/ api/
ERROR: failed to copy: GET http://gha-cache-server/download/<uuid>
RESPONSE 404: 404 Server Error

The workflow aborts instead of rebuilding from source. After #223 landed, workflows that previously hung now fail fast with the same net effect: broken builds whenever an entry's storage is out-of-sync with the DB (bucket wipe, lifecycle-deleted object, unfinalized upload).

The cache-miss signal that BuildKit (and any spec-conforming client) actually observes is ok: false from the twirp GetCacheEntryDownloadURL lookup — not a 404 on the download URL it was handed. As long as the lookup returns a download URL, the client has committed to downloading, and any error at that stage is a failure.

Solution

Validate storage availability in Storage.getCacheEntryWithDownloadUrl before handing back a download URL. If the matched entry's storage is missing, purge the stale cache_entries row and re-run matchCacheEntry so restore keys can still fall through to a valid candidate. If no valid match remains, return undefined → the route returns { ok: false } → the client treats it as a clean cache miss.

This also self-heals the DB proactively rather than leaving cleanup to the 30-day cleanup:cache-entries task, so repeated workflow runs don't keep rediscovering the same stale row.

Changes

  • New StorageAdapter.objectExists(objectName) on S3 (HeadObjectCommand), filesystem (fs.access), and GCS (file.exists()). Used for merged-blob existence checks — cheaper than countFilesInFolder on a single object.
  • Storage.storageHasData(location) private helper: returns true iff mergedAt is set and the merged blob exists, or no merge has happened and partCount matches what's on disk. partsDeletedAt with no mergedAt implies data loss → returns false.
  • Storage.getCacheEntryWithDownloadUrl now loops (bounded at 10 attempts): on each iteration it matches, validates the match's storage, and — if invalid — deletes the cache_entries row and retries. First valid match wins; no valid match → undefined.
  • Test: falls back to a valid restore key when the best match has missing storage — saves a stale entry, wipes storage, saves a valid entry, then asserts that a restore with the stale key listed first still returns the valid one. Exercises the retry loop and the cross-key fallback.

Downsides / trade-offs

  • One extra storage call per lookup (HEAD for merged, LIST for parts). Negligible on S3/GCS; slightly more noticeable on filesystem but still sub-millisecond.
  • Self-healing on the read path (DELETE FROM cache_entries) was deliberately avoided in fix: return 404 for stale cache entries with missing storage objects #223. Reintroducing it here is the minimum needed to keep a single lookup well-formed for clients that can't retry at a higher level — BuildKit chief among them. The cleanup task remains the backstop for orphaned storage_locations.
  • Bounded retry (10) guards against a pathological scan if a flood of stale rows exists for the same prefix. In practice, one or two iterations suffice.

Testing

Representative server log confirming the fix on a real BuildKit request (Garage-backed S3, post-bucket-wipe):

[warn] Cache entry 7cd91bc9-... (setup-rust-linux-x86_64-cargo) points at missing storage, purging.
[debug] Response: POST /twirp/github.actions.results.api.v1.CacheService/GetCacheEntryDownloadURL > 200 {"ok":false}

The workflow then builds from source as expected instead of aborting on 404.

Notes

BuildKit treats a 404 on /download/<id> as a hard build failure, not a
cache miss. Returning the 404 at download time (previous fix) still
breaks builds because the twirp lookup hands BuildKit a download URL
that later 404s.

Validate storage has the data in getCacheEntryWithDownloadUrl: if the
matched entry's merged blob is missing (or parts folder is incomplete),
purge the stale cache_entries row and retry matching. BuildKit then
sees ok=false and treats it as a cache miss, building normally.

Adds objectExists to StorageAdapter (S3 HeadObject, fs.access, GCS
exists) and a restore-key fallback test covering the retry loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant