Skip to content

reviewer: retryable agent timeouts stop after three failures despite higher queue retry budget #221

@mrcfps

Description

@mrcfps

Summary

Reviewer agent timeout failures are classified as retryable_transient, but affected loops still end in failed and stop retrying after about three failed attempts, even though the queue item has max_attempts = 5 and the configured reviewer agent runtime is much larger.

Evidence

From recent failed reviewer loops:

loop PR queue status attempts max attempts error kind actual run duration
292 nexu-io/open-design#381 failed 3 5 retryable_transient 15m22s
317 nexu-io/open-design#418 failed 3 5 retryable_transient 15m21s
335 nexu-io/open-design#439 failed 3 5 retryable_transient 12m44s
343 nexu-io/open-design#424 failed 3 5 retryable_transient 12m43s

Local config has:

"agent": {
  "timeouts": {
    "reviewerSeconds": 21600
  }
}

So these are not reaching the configured 6h reviewer runtime. They appear to be agent-runtime/turn timeouts around 12–15 minutes.

Loop metadata for several cases shows the reviewer loop budget terminating the loop after three consecutive failures:

loop.status = failed
loop.terminationReason = max_consecutive_failures
loop.consecutiveFailures = 3

Root cause

The queue-level retry budget and reviewer loop failure budget are inconsistent:

  • failQueueItem retries retryable failures when attempts < max_attempts.
  • But reviewer loop metadata increments consecutiveFailures on every failed run.
  • After MaxConsecutiveFailures (default/current effective value appears to be 3), the loop becomes terminal and any queued retry can be failed/cancelled before exhausting queue_items.max_attempts.

Important code locations:

  • internal/reviewer/runner.go:1048-1119 — failure path records loop failure metadata, updates loop state, and can force terminal failure
  • internal/reviewer/runner.go:3180-3197 — queue retry checks isRetryableFailure(kind) && nextAttempts < queueItem.MaxAttempts
  • internal/reviewer/runner.go:3703-3715 — increments failureCount / consecutiveFailures
  • internal/reviewer/runner.go:3681-3690 — terminal loop status prevents further processing
  • internal/runtime/runtime.go:1572-1619 — auto recovery whitelist does not recover generic retryable_transient timeouts

Expected behavior

Retryable transient timeouts should not be converted into terminal reviewer loops before the configured retry budget is exhausted.

At minimum, the effective stopping condition should be explainable and consistent in looper ps / logs.

Proposed fix

  • Align MaxConsecutiveFailures with scheduler.retryMaxAttempts, or do not count retryable_transient agent timeouts against terminal consecutive-failure budget.
  • Alternatively, allow failed reviewer loop auto-recovery for retryable_transient timeout failures while attempts remain.
  • Preserve and show the real timeout type/config (max_runtime vs idle, configured seconds, elapsed seconds) in the queue error summary.
  • Investigate why the effective agent timeout is ~12–15 minutes despite reviewerSeconds = 21600.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions