Skip to content

Fix golden cache codepath bugs and raise BuildKit GC limit#1788

Merged
lukemarsden merged 1 commit intomainfrom
fix/golden-cache-codepath-bugs
Mar 2, 2026
Merged

Fix golden cache codepath bugs and raise BuildKit GC limit#1788
lukemarsden merged 1 commit intomainfrom
fix/golden-cache-codepath-bugs

Conversation

@lukemarsden
Copy link
Collaborator

Summary

  • Root cause of golden cache misses identified: BuildKit doesn't refresh LastUsedAt on cache hits, so golden build entries look "stale" to GC and get evicted when session builds temporarily exceed the default 93 GiB limit. Raised to 300 GiB so GC never triggers.
  • Fixed race condition: PromoteSessionToGolden could rename the golden directory while SetupGoldenCopy was reading from it. Added per-project RWMutex (read lock for copies, write lock for promotion).
  • Added golden cache versioning: golden-version.json written on each promotion with generation number, timestamp, and session ID. Logged during copy so containers can identify which golden they're running from.
  • Removed dead code: GoldenBuildRunning/SetGoldenBuildRunning file-based lock was written but never read (golden build service uses its own in-memory map).
  • Fixed silent error handling: .golden-build-result removal now checks and logs errors instead of ignoring them (failed removal causes premature promotion).
  • Added early termination: parallelCopyDir stops launching new cp jobs once one fails.

Test plan

  • Deploy to sandbox, trigger a golden build, verify golden-version.json is written
  • Start a session, verify logs show golden generation/session ID during copy
  • Verify BuildKit GC config is written to buildkitd.toml with 300 GiB limit on API restart
  • Verify subsequent sessions get improved cache hits with the higher GC limit
  • go build ./api/pkg/hydra/ ./api/pkg/services/ ./api/pkg/external-agent/ ./api/pkg/server/ passes

🤖 Generated with Claude Code

Root cause of cache misses: BuildKit doesn't refresh LastUsedAt on cache
hits, so golden build entries look "stale" to GC and get evicted when
later session builds temporarily push the cache over the default 93 GiB
limit. Raise to 300 GiB so GC never runs.

Golden cache codepath fixes:
- Remove dead GoldenBuildRunning/SetGoldenBuildRunning file lock (written
  but never read; golden_build_service uses in-memory map instead)
- Add per-project RWMutex to prevent race between PromoteSessionToGolden
  and SetupGoldenCopy (promotion could rename dir mid-copy)
- Add golden-version.json with generation number, timestamp, session ID
  so containers can identify which golden cache they're running from
- Check .golden-build-result removal errors instead of silently ignoring
  (failed removal causes premature promotion)
- Short-circuit parallelCopyDir on first error instead of continuing to
  launch doomed cp jobs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lukemarsden lukemarsden merged commit 9bdf57d into main Mar 2, 2026
3 checks passed
@lukemarsden lukemarsden deleted the fix/golden-cache-codepath-bugs branch March 2, 2026 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant