Skip to content

reviewer: GitHub CLI/API transient failures in discover/snapshot are persisted as non-retryable #223

@mrcfps

Description

@mrcfps

Summary

Some reviewer failures caused by transient GitHub API/CLI errors are persisted as non_retryable instead of being retried.

Evidence

Recent failed reviewer loops:

loop step error kind error
211 discover non_retryable Post "https://api.github.com/graphql": unexpected EOF
219 discover non_retryable Post "https://api.github.com/graphql": net/http: TLS handshake timeout
357 snapshot non_retryable HTTP 504: We couldn't respond to your request in time
241 discover non_retryable GraphQL: Could not resolve to a PullRequest with the number of 345

The first three are clear transient infrastructure failures and should be retried. The fourth is a stale/nonexistent PR case and should be skipped/cancelled, not failed as a reviewer bug.

Root cause

The codebase has a GitHub transient classifier:

  • internal/infra/github/errors.go:23-44IsTransientError
  • internal/infra/github/errors.go:62-90 — detects tls handshake timeout, unexpected eof, http 504, etc.
  • internal/reviewer/runner.go:3226-3228 — maps githubinfra.IsTransientError(err) to FailureRetryableTransient

But production DB records show these errors are still ending as non_retryable. This suggests at least one of:

  1. the gateway/adapter path is wrapping GitHub CLI errors into plain strings that no longer satisfy the transient classifier,
  2. discover-step failures are not going through the same transient retry/classification path as snapshot/review,
  3. stale PR-not-found errors are not distinguished from infrastructure failures.

Also, reviewerStepSupportsTransientExternalRetry only includes:

stepSnapshot, stepThreadResolution, stepReview

and does not include stepDiscover, even though discover performs GitHub GraphQL calls.

Expected behavior

  • GitHub network/service failures should be retried with bounded backoff and recorded as retryable_transient if the retry budget is exhausted.
  • Could not resolve to a PullRequest should become skipped/cancelled if the PR no longer exists, not a failed reviewer run.
  • Discover should participate in the same transient external retry policy as snapshot/review.

Proposed fix

  • Include stepDiscover in reviewerStepSupportsTransientExternalRetry.
  • Add tests covering real gateway/CLI error values for discover and snapshot paths, not just synthetic TransientError wrappers.
  • Classify PR-not-found / deleted PR as a terminal skip/cancel reason.
  • Ensure queue/loop error summaries preserve the original GitHub error while using the correct retry kind.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions