ci: defensive hardening (timeouts + bounded log-dump + dispatch macOS-only)#8
Merged
Conversation
…cOS-only Tier-0 fixes for the self-hosted-runner stalls hit this session, all config, no external watchdog needed: - timeout-minutes on every build job (linux 60, windows 75, macos 70, benchmarks 60; release.yml mirrors). A hung step can no longer occupy a self-hosted runner indefinitely — GitHub kills the job and frees the runner. Directly prevents the 2h+ wedge a hung vcpkg log-dump caused. - The "Dump vcpkg logs on failure" steps get timeout-minutes: 3 + continue-on-error: true. That diagnostic step is exactly what hung for 2h; a failure handler must never outlive its purpose or block the job. - workflow_dispatch now runs the macOS leg ONLY: linux + windows gain `if: github.event_name != 'workflow_dispatch'`. Dispatch exists solely to validate the billed macOS leg on a branch; previously it also ran redundant Linux/Windows legs that competed for the runners (a cause of today's queue starvation). macos keeps its push|dispatch guard; benchmarks is push+main. Workflows validated (yaml parse + per-job timeout/guard check). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tier-0 defensive CI from the runner-stall incidents this session. All config, no external watchdog.
Changes
timeout-minuteson every build job (linux 60, windows 75, macos 70, benchmarks 60; release.yml mirrors). A hung step can't occupy a self-hosted runner forever — GitHub kills it, runner frees. Prevents the 2h+ wedge.Dump vcpkg logs on failuregetstimeout-minutes: 3+continue-on-error: true. That step is exactly what hung 2h; a failure handler must never block the job.linux/windowsgainif: github.event_name != 'workflow_dispatch'. Dispatch exists only to validate the billed macOS leg; it previously also ran redundant Linux/Windows legs that starved the runners (a cause of today's queueing).Why config over a watchdog service
These prevent every stall hit this session with ~20 lines and zero ongoing cost / no host access. A host-side systemd watchdog is only needed for the rare true process-wedge (deferred).
Validated: both workflows parse; per-job timeout + guard confirmed.
🤖 Generated with Claude Code