reviewer: GitHub CLI/API transient failures in discover/snapshot are persisted as non-retryable

## Summary

Some reviewer failures caused by transient GitHub API/CLI errors are persisted as `non_retryable` instead of being retried.

## Evidence

Recent failed reviewer loops:

| loop | step | error kind | error |
| --- | --- | --- | --- |
| 211 | discover | non_retryable | `Post "https://api.github.com/graphql": unexpected EOF` |
| 219 | discover | non_retryable | `Post "https://api.github.com/graphql": net/http: TLS handshake timeout` |
| 357 | snapshot | non_retryable | `HTTP 504: We couldn't respond to your request in time` |
| 241 | discover | non_retryable | `GraphQL: Could not resolve to a PullRequest with the number of 345` |

The first three are clear transient infrastructure failures and should be retried. The fourth is a stale/nonexistent PR case and should be skipped/cancelled, not failed as a reviewer bug.

## Root cause

The codebase has a GitHub transient classifier:

- `internal/infra/github/errors.go:23-44` — `IsTransientError`
- `internal/infra/github/errors.go:62-90` — detects `tls handshake timeout`, `unexpected eof`, `http 504`, etc.
- `internal/reviewer/runner.go:3226-3228` — maps `githubinfra.IsTransientError(err)` to `FailureRetryableTransient`

But production DB records show these errors are still ending as `non_retryable`. This suggests at least one of:

1. the gateway/adapter path is wrapping GitHub CLI errors into plain strings that no longer satisfy the transient classifier,
2. discover-step failures are not going through the same transient retry/classification path as snapshot/review,
3. stale PR-not-found errors are not distinguished from infrastructure failures.

Also, `reviewerStepSupportsTransientExternalRetry` only includes:

```go
stepSnapshot, stepThreadResolution, stepReview
```

and does not include `stepDiscover`, even though discover performs GitHub GraphQL calls.

## Expected behavior

- GitHub network/service failures should be retried with bounded backoff and recorded as `retryable_transient` if the retry budget is exhausted.
- `Could not resolve to a PullRequest` should become skipped/cancelled if the PR no longer exists, not a failed reviewer run.
- Discover should participate in the same transient external retry policy as snapshot/review.

## Proposed fix

- Include `stepDiscover` in `reviewerStepSupportsTransientExternalRetry`.
- Add tests covering real gateway/CLI error values for discover and snapshot paths, not just synthetic `TransientError` wrappers.
- Classify PR-not-found / deleted PR as a terminal skip/cancel reason.
- Ensure queue/loop error summaries preserve the original GitHub error while using the correct retry kind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reviewer: GitHub CLI/API transient failures in discover/snapshot are persisted as non-retryable #223

Summary

Evidence

Root cause

Expected behavior

Proposed fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

loop	step	error kind	error
211	discover	non_retryable	`Post "https://api.github.com/graphql": unexpected EOF`
219	discover	non_retryable	`Post "https://api.github.com/graphql": net/http: TLS handshake timeout`
357	snapshot	non_retryable	`HTTP 504: We couldn't respond to your request in time`
241	discover	non_retryable	`GraphQL: Could not resolve to a PullRequest with the number of 345`

reviewer: GitHub CLI/API transient failures in discover/snapshot are persisted as non-retryable #223

Description

Summary

Evidence

Root cause

Expected behavior

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions