Summary
Reviewer agent timeout failures are classified as retryable_transient, but affected loops still end in failed and stop retrying after about three failed attempts, even though the queue item has max_attempts = 5 and the configured reviewer agent runtime is much larger.
Evidence
From recent failed reviewer loops:
Local config has:
"agent": {
"timeouts": {
"reviewerSeconds": 21600
}
}
So these are not reaching the configured 6h reviewer runtime. They appear to be agent-runtime/turn timeouts around 12–15 minutes.
Loop metadata for several cases shows the reviewer loop budget terminating the loop after three consecutive failures:
loop.status = failed
loop.terminationReason = max_consecutive_failures
loop.consecutiveFailures = 3
Root cause
The queue-level retry budget and reviewer loop failure budget are inconsistent:
failQueueItem retries retryable failures when attempts < max_attempts.
- But reviewer loop metadata increments
consecutiveFailures on every failed run.
- After
MaxConsecutiveFailures (default/current effective value appears to be 3), the loop becomes terminal and any queued retry can be failed/cancelled before exhausting queue_items.max_attempts.
Important code locations:
internal/reviewer/runner.go:1048-1119 — failure path records loop failure metadata, updates loop state, and can force terminal failure
internal/reviewer/runner.go:3180-3197 — queue retry checks isRetryableFailure(kind) && nextAttempts < queueItem.MaxAttempts
internal/reviewer/runner.go:3703-3715 — increments failureCount / consecutiveFailures
internal/reviewer/runner.go:3681-3690 — terminal loop status prevents further processing
internal/runtime/runtime.go:1572-1619 — auto recovery whitelist does not recover generic retryable_transient timeouts
Expected behavior
Retryable transient timeouts should not be converted into terminal reviewer loops before the configured retry budget is exhausted.
At minimum, the effective stopping condition should be explainable and consistent in looper ps / logs.
Proposed fix
- Align
MaxConsecutiveFailures with scheduler.retryMaxAttempts, or do not count retryable_transient agent timeouts against terminal consecutive-failure budget.
- Alternatively, allow failed reviewer loop auto-recovery for
retryable_transient timeout failures while attempts remain.
- Preserve and show the real timeout type/config (
max_runtime vs idle, configured seconds, elapsed seconds) in the queue error summary.
- Investigate why the effective agent timeout is ~12–15 minutes despite
reviewerSeconds = 21600.
Summary
Reviewer agent timeout failures are classified as
retryable_transient, but affected loops still end infailedand stop retrying after about three failed attempts, even though the queue item hasmax_attempts = 5and the configured reviewer agent runtime is much larger.Evidence
From recent failed reviewer loops:
Local config has:
So these are not reaching the configured 6h reviewer runtime. They appear to be agent-runtime/turn timeouts around 12–15 minutes.
Loop metadata for several cases shows the reviewer loop budget terminating the loop after three consecutive failures:
Root cause
The queue-level retry budget and reviewer loop failure budget are inconsistent:
failQueueItemretries retryable failures whenattempts < max_attempts.consecutiveFailureson every failed run.MaxConsecutiveFailures(default/current effective value appears to be 3), the loop becomes terminal and any queued retry can be failed/cancelled before exhaustingqueue_items.max_attempts.Important code locations:
internal/reviewer/runner.go:1048-1119— failure path records loop failure metadata, updates loop state, and can force terminal failureinternal/reviewer/runner.go:3180-3197— queue retry checksisRetryableFailure(kind) && nextAttempts < queueItem.MaxAttemptsinternal/reviewer/runner.go:3703-3715— incrementsfailureCount/consecutiveFailuresinternal/reviewer/runner.go:3681-3690— terminal loop status prevents further processinginternal/runtime/runtime.go:1572-1619— auto recovery whitelist does not recover genericretryable_transienttimeoutsExpected behavior
Retryable transient timeouts should not be converted into terminal reviewer loops before the configured retry budget is exhausted.
At minimum, the effective stopping condition should be explainable and consistent in
looper ps/ logs.Proposed fix
MaxConsecutiveFailureswithscheduler.retryMaxAttempts, or do not countretryable_transientagent timeouts against terminal consecutive-failure budget.retryable_transienttimeout failures while attempts remain.max_runtimevsidle, configured seconds, elapsed seconds) in the queue error summary.reviewerSeconds = 21600.