fix(ci): watchdog polls + exits early instead of sleeping the full timeout#50
Merged
Merged
Conversation
…timeout The Queue Watchdog `sleep $((QUEUE_TIMEOUT_MINUTES*60))`d unconditionally on every _ci-gate.yml run — a flat 10 paid runner-minutes per run on hosted runners, for a stuck-`queued` scenario that only occurs on self-hosted runners that never pick up a job. On GitHub-hosted runners siblings schedule in seconds. Now it polls every 15s and exits the instant no sibling is `queued`; it only cancels jobs still stuck in `queued` after the timeout. Same protection for self-hosted runners, but hosted-runner consumers exit in seconds instead of burning the full timeout every run. The watchdog job checks this script out from `main` at runtime (ref: main), so every consumer's next run picks up the fix on merge — no per-repo change needed. Assisted-by: Claude:claude-opus-4-8 Claude-Session: https://claude.ai/code/session_01WfUaGNSoryQJduufUVWdnt
There was a problem hiding this comment.
Code Review
This pull request optimizes the CI gate watchdog script by replacing a fixed sleep with a polling loop that exits as soon as no sibling jobs are queued, preventing unnecessary runner time usage. Feedback on the changes points out a potential crash if QUEUE_TIMEOUT_MINUTES is a float, as bash arithmetic does not support floats, and suggests optimizing the loop by using the bash built-in $SECONDS variable instead of spawning a subshell with date +%s on every iteration.
The poll loop replaced main's awk-based parse with bash $(( )) arithmetic, which crashes on a non-integer QUEUE_TIMEOUT_MINUTES (the workflow input is type: number and accepts floats, e.g. 0.5). Restore the awk parse so floats truncate to an integer instead of erroring. Also use the bash builtin $SECONDS for the deadline check instead of calling $(date +%s) once per poll iteration, dropping a subshell per loop. Assisted-by: Claude:claude-opus-4-8[1m] Claude-Session: https://claude.ai/code/session_014Rk2eavCBLAp3HReZD37R4
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
_ci-gate.ymlQueue Watchdog ransleep $((QUEUE_TIMEOUT_MINUTES*60))unconditionally — a flat 10 paid runner-minutes per run onubuntu-latest, on every push of every consumer. The stuck-queuedscenario it guards against only happens on self-hosted runners that never pick up a job; on GitHub-hosted runners siblings schedule within seconds, so the sleep was pure waste (and it held each run "in progress" ~10 min).Spotted on a private consumer (
VisiCore/vct-splunk-cli) where those minutes are billed.Fix
Poll the run's sibling jobs every 15s and exit the instant none is
queued(everything got scheduled). Only cancel jobs still stuck inqueuedonceQUEUE_TIMEOUT_MINUTESelapses. Self-hosted protection is unchanged; hosted-runner consumers now exit in seconds.Because the watchdog job sparse-checks this script out from
mainat runtime (ref: main), every consumer's next run picks up the fix on merge — no per-repo change required.Validated:
bash -n+shellcheckclean.🤖 Generated with Claude Code