Summary
Some reviewer failures caused by transient GitHub API/CLI errors are persisted as non_retryable instead of being retried.
Evidence
Recent failed reviewer loops:
| loop |
step |
error kind |
error |
| 211 |
discover |
non_retryable |
Post "https://api.github.com/graphql": unexpected EOF |
| 219 |
discover |
non_retryable |
Post "https://api.github.com/graphql": net/http: TLS handshake timeout |
| 357 |
snapshot |
non_retryable |
HTTP 504: We couldn't respond to your request in time |
| 241 |
discover |
non_retryable |
GraphQL: Could not resolve to a PullRequest with the number of 345 |
The first three are clear transient infrastructure failures and should be retried. The fourth is a stale/nonexistent PR case and should be skipped/cancelled, not failed as a reviewer bug.
Root cause
The codebase has a GitHub transient classifier:
internal/infra/github/errors.go:23-44 — IsTransientError
internal/infra/github/errors.go:62-90 — detects tls handshake timeout, unexpected eof, http 504, etc.
internal/reviewer/runner.go:3226-3228 — maps githubinfra.IsTransientError(err) to FailureRetryableTransient
But production DB records show these errors are still ending as non_retryable. This suggests at least one of:
- the gateway/adapter path is wrapping GitHub CLI errors into plain strings that no longer satisfy the transient classifier,
- discover-step failures are not going through the same transient retry/classification path as snapshot/review,
- stale PR-not-found errors are not distinguished from infrastructure failures.
Also, reviewerStepSupportsTransientExternalRetry only includes:
stepSnapshot, stepThreadResolution, stepReview
and does not include stepDiscover, even though discover performs GitHub GraphQL calls.
Expected behavior
- GitHub network/service failures should be retried with bounded backoff and recorded as
retryable_transient if the retry budget is exhausted.
Could not resolve to a PullRequest should become skipped/cancelled if the PR no longer exists, not a failed reviewer run.
- Discover should participate in the same transient external retry policy as snapshot/review.
Proposed fix
- Include
stepDiscover in reviewerStepSupportsTransientExternalRetry.
- Add tests covering real gateway/CLI error values for discover and snapshot paths, not just synthetic
TransientError wrappers.
- Classify PR-not-found / deleted PR as a terminal skip/cancel reason.
- Ensure queue/loop error summaries preserve the original GitHub error while using the correct retry kind.
Summary
Some reviewer failures caused by transient GitHub API/CLI errors are persisted as
non_retryableinstead of being retried.Evidence
Recent failed reviewer loops:
Post "https://api.github.com/graphql": unexpected EOFPost "https://api.github.com/graphql": net/http: TLS handshake timeoutHTTP 504: We couldn't respond to your request in timeGraphQL: Could not resolve to a PullRequest with the number of 345The first three are clear transient infrastructure failures and should be retried. The fourth is a stale/nonexistent PR case and should be skipped/cancelled, not failed as a reviewer bug.
Root cause
The codebase has a GitHub transient classifier:
internal/infra/github/errors.go:23-44—IsTransientErrorinternal/infra/github/errors.go:62-90— detectstls handshake timeout,unexpected eof,http 504, etc.internal/reviewer/runner.go:3226-3228— mapsgithubinfra.IsTransientError(err)toFailureRetryableTransientBut production DB records show these errors are still ending as
non_retryable. This suggests at least one of:Also,
reviewerStepSupportsTransientExternalRetryonly includes:and does not include
stepDiscover, even though discover performs GitHub GraphQL calls.Expected behavior
retryable_transientif the retry budget is exhausted.Could not resolve to a PullRequestshould become skipped/cancelled if the PR no longer exists, not a failed reviewer run.Proposed fix
stepDiscoverinreviewerStepSupportsTransientExternalRetry.TransientErrorwrappers.