Skip to content

fix(ci): watchdog polls + exits early instead of sleeping the full timeout#50

Merged
JacobPEvans-personal merged 2 commits into
mainfrom
fix/ci-gate-watchdog-early-exit
Jun 27, 2026
Merged

fix(ci): watchdog polls + exits early instead of sleeping the full timeout#50
JacobPEvans-personal merged 2 commits into
mainfrom
fix/ci-gate-watchdog-early-exit

Conversation

@JacobPEvans-personal

Copy link
Copy Markdown
Member

Problem

The _ci-gate.yml Queue Watchdog ran sleep $((QUEUE_TIMEOUT_MINUTES*60)) unconditionally — a flat 10 paid runner-minutes per run on ubuntu-latest, on every push of every consumer. The stuck-queued scenario it guards against only happens on self-hosted runners that never pick up a job; on GitHub-hosted runners siblings schedule within seconds, so the sleep was pure waste (and it held each run "in progress" ~10 min).

Spotted on a private consumer (VisiCore/vct-splunk-cli) where those minutes are billed.

Fix

Poll the run's sibling jobs every 15s and exit the instant none is queued (everything got scheduled). Only cancel jobs still stuck in queued once QUEUE_TIMEOUT_MINUTES elapses. Self-hosted protection is unchanged; hosted-runner consumers now exit in seconds.

Because the watchdog job sparse-checks this script out from main at runtime (ref: main), every consumer's next run picks up the fix on merge — no per-repo change required.

Validated: bash -n + shellcheck clean.

🤖 Generated with Claude Code

…timeout

The Queue Watchdog `sleep $((QUEUE_TIMEOUT_MINUTES*60))`d unconditionally on
every _ci-gate.yml run — a flat 10 paid runner-minutes per run on hosted
runners, for a stuck-`queued` scenario that only occurs on self-hosted runners
that never pick up a job. On GitHub-hosted runners siblings schedule in seconds.

Now it polls every 15s and exits the instant no sibling is `queued`; it only
cancels jobs still stuck in `queued` after the timeout. Same protection for
self-hosted runners, but hosted-runner consumers exit in seconds instead of
burning the full timeout every run.

The watchdog job checks this script out from `main` at runtime (ref: main), so
every consumer's next run picks up the fix on merge — no per-repo change needed.

Assisted-by: Claude:claude-opus-4-8
Claude-Session: https://claude.ai/code/session_01WfUaGNSoryQJduufUVWdnt

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the CI gate watchdog script by replacing a fixed sleep with a polling loop that exits as soon as no sibling jobs are queued, preventing unnecessary runner time usage. Feedback on the changes points out a potential crash if QUEUE_TIMEOUT_MINUTES is a float, as bash arithmetic does not support floats, and suggests optimizing the loop by using the bash built-in $SECONDS variable instead of spawning a subshell with date +%s on every iteration.

Comment thread scripts/ci-gate-watchdog.sh Outdated
The poll loop replaced main's awk-based parse with bash $(( )) arithmetic,
which crashes on a non-integer QUEUE_TIMEOUT_MINUTES (the workflow input is
type: number and accepts floats, e.g. 0.5). Restore the awk parse so floats
truncate to an integer instead of erroring.

Also use the bash builtin $SECONDS for the deadline check instead of calling
$(date +%s) once per poll iteration, dropping a subshell per loop.

Assisted-by: Claude:claude-opus-4-8[1m]
Claude-Session: https://claude.ai/code/session_014Rk2eavCBLAp3HReZD37R4
@JacobPEvans-personal JacobPEvans-personal merged commit feebd66 into main Jun 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant