fix: return ok=false from cache lookup when storage is missing (follow-up to #223)#230
Open
mpecan wants to merge 1 commit intofalcondev-oss:devfrom
Open
fix: return ok=false from cache lookup when storage is missing (follow-up to #223)#230mpecan wants to merge 1 commit intofalcondev-oss:devfrom
mpecan wants to merge 1 commit intofalcondev-oss:devfrom
Conversation
BuildKit treats a 404 on /download/<id> as a hard build failure, not a cache miss. Returning the 404 at download time (previous fix) still breaks builds because the twirp lookup hands BuildKit a download URL that later 404s. Validate storage has the data in getCacheEntryWithDownloadUrl: if the matched entry's merged blob is missing (or parts folder is incomplete), purge the stale cache_entries row and retry matching. BuildKit then sees ok=false and treats it as a cache miss, building normally. Adds objectExists to StorageAdapter (S3 HeadObject, fs.access, GCS exists) and a restore-key fallback test covering the retry loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #223, which returned 404 from
/download/:cacheEntryIdwhen the backing storage object was missing. That fix resolved the 500/hang for clients that retry on 5xx and fall back on 4xx — notably the@actions/cachelibrary used by GitHub Actions workflows.Problem
Container-build tooling (BuildKit's GHA cache-import, at minimum) treats any non-200 response on the download URL as a hard build failure, not a cache miss. Concretely:
The workflow aborts instead of rebuilding from source. After #223 landed, workflows that previously hung now fail fast with the same net effect: broken builds whenever an entry's storage is out-of-sync with the DB (bucket wipe, lifecycle-deleted object, unfinalized upload).
The cache-miss signal that BuildKit (and any spec-conforming client) actually observes is
ok: falsefrom the twirpGetCacheEntryDownloadURLlookup — not a 404 on the download URL it was handed. As long as the lookup returns a download URL, the client has committed to downloading, and any error at that stage is a failure.Solution
Validate storage availability in
Storage.getCacheEntryWithDownloadUrlbefore handing back a download URL. If the matched entry's storage is missing, purge the stalecache_entriesrow and re-runmatchCacheEntryso restore keys can still fall through to a valid candidate. If no valid match remains, returnundefined→ the route returns{ ok: false }→ the client treats it as a clean cache miss.This also self-heals the DB proactively rather than leaving cleanup to the 30-day
cleanup:cache-entriestask, so repeated workflow runs don't keep rediscovering the same stale row.Changes
StorageAdapter.objectExists(objectName)on S3 (HeadObjectCommand), filesystem (fs.access), and GCS (file.exists()). Used for merged-blob existence checks — cheaper thancountFilesInFolderon a single object.Storage.storageHasData(location)private helper: returnstrueiffmergedAtis set and the merged blob exists, or no merge has happened andpartCountmatches what's on disk.partsDeletedAtwith nomergedAtimplies data loss → returnsfalse.Storage.getCacheEntryWithDownloadUrlnow loops (bounded at 10 attempts): on each iteration it matches, validates the match's storage, and — if invalid — deletes thecache_entriesrow and retries. First valid match wins; no valid match →undefined.falls back to a valid restore key when the best match has missing storage— saves a stale entry, wipes storage, saves a valid entry, then asserts that a restore with the stale key listed first still returns the valid one. Exercises the retry loop and the cross-key fallback.Downsides / trade-offs
DELETE FROM cache_entries) was deliberately avoided in fix: return 404 for stale cache entries with missing storage objects #223. Reintroducing it here is the minimum needed to keep a single lookup well-formed for clients that can't retry at a higher level — BuildKit chief among them. The cleanup task remains the backstop for orphanedstorage_locations.Testing
pnpm run type-check✓pnpm run lint✓pnpm run test:run✓ — 8/8 pass including the new case and the two from fix: return 404 for stale cache entries with missing storage objects #223.Representative server log confirming the fix on a real BuildKit request (Garage-backed S3, post-bucket-wipe):
The workflow then builds from source as expected instead of aborting on 404.
Notes
{ ok: false }where (after fix: return 404 for stale cache entries with missing storage objects #223) the lookup would have succeeded and then the download would have 404'd.