fix: return null for unknown stop reason in anthropic handler#1995
fix: return null for unknown stop reason in anthropic handler#1995
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
WalkthroughThis PR introduces credential-aware retry logic by adding Changes
Sequence DiagramsequenceDiagram
participant Client
participant ChatHandler as Chat Handler<br/>(chat.ts)
participant ErrorClass as Error Classification<br/>(get-finish-reason-from-error.ts)
participant RetryLogic as Retry Decision<br/>(retry-with-fallback.ts)
participant KeyHealth as Key Health Tracker<br/>(api-key-health.ts)
participant CredChecker as Credential Checker<br/>(provider-auth-errors.ts)
participant NextProvider as Alternate<br/>Provider
Client->>ChatHandler: Request with API Key
alt HTTP Error Response
ChatHandler->>ErrorClass: Classify error (status, body)
ErrorClass->>CredChecker: hasInvalidProviderCredentialError(errorText)?
CredChecker-->>ErrorClass: true/false
ErrorClass-->>ChatHandler: gateway_error or other
else Streaming Error
ChatHandler->>ErrorClass: Get finish reason (errorType, errorText)
ErrorClass->>CredChecker: Check credential error
CredChecker-->>ErrorClass: Result
ErrorClass-->>ChatHandler: finish reason
end
ChatHandler->>RetryLogic: shouldRetryAlternateKey(finishReason, status, errorText)?
RetryLogic->>CredChecker: Evaluate credential error condition
CredChecker-->>RetryLogic: Matches invalid key pattern?
RetryLogic-->>ChatHandler: true/false (retry alternate?)
alt Should Retry Alternate Key
ChatHandler->>KeyHealth: rememberFailedKey(currentKey, errorText)
KeyHealth->>CredChecker: Check if permanent credential error
CredChecker-->>KeyHealth: Result
KeyHealth-->>ChatHandler: Mark key unhealthy/blacklisted
ChatHandler->>NextProvider: Retry with alternate key
NextProvider-->>Client: Response (success/failure)
else Don't Retry
ChatHandler->>Client: Return error response
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR expands the gateway’s handling of upstream provider credential failures by detecting invalid-provider-key error payloads (even when they come back as 400s), using that signal to (a) permanently blacklist bad keys in key health tracking and (b) retry the same provider with an alternate configured key.
Changes:
- Added
hasInvalidProviderCredentialError()helper and used it to classify/handle invalid-key payloads consistently. - Updated API key health tracking to permanently blacklist keys based on invalid-credential error text (not only 401/403).
- Updated retry logic to rotate to an alternate key on same-provider auth failures, and added test coverage + mock server triggers.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| apps/gateway/src/test-utils/mock-openai-server.ts | Adds a new fail-once invalid-key trigger to simulate 400 auth-like payloads for retry tests. |
| apps/gateway/src/lib/provider-auth-errors.ts | Introduces shared invalid-credential error-text detection helper. |
| apps/gateway/src/lib/api-key-health.ts | Uses new helper to permanently blacklist keys based on invalid credential payload text. |
| apps/gateway/src/lib/api-key-health.spec.ts | Adds coverage for permanent blacklisting on 400 invalid-key text. |
| apps/gateway/src/fallback.spec.ts | Resets key health between tests and adds scenarios validating same-provider key rotation for auth-like failures. |
| apps/gateway/src/chat/tools/retry-with-fallback.ts | Adds shouldRetryAlternateKey() to decide when to rotate keys within the same provider. |
| apps/gateway/src/chat/tools/retry-with-fallback.spec.ts | Adds unit tests for shouldRetryAlternateKey(). |
| apps/gateway/src/chat/tools/get-finish-reason-from-error.ts | Classifies known invalid-credential payload text as gateway_error (even for 400s). |
| apps/gateway/src/chat/tools/get-finish-reason-from-error.spec.ts | Adds test asserting 400 invalid-key payload becomes gateway_error. |
| apps/gateway/src/chat/chat.ts | Switches same-provider retry decision to shouldRetryAlternateKey() to include auth/invalid-key cases. |
Comments suppressed due to low confidence (1)
apps/gateway/src/chat/tools/get-finish-reason-from-error.ts:14
- The header comment for
getFinishReasonFromErrorstill describes auth classification only in terms of 401/403 status codes, but the implementation now also treats certain 400 responses (detected viahasInvalidProviderCredentialError(errorText)) asgateway_error. Consider updating the doc comment to reflect this additional classification rule so future readers don’t assume status-code-only behavior.
/**
* Determines the appropriate finish reason based on HTTP status code and error message
* 5xx status codes indicate upstream provider errors
* 429 status codes indicate upstream rate limiting (treated as upstream error)
* 404 status codes indicate model/endpoint not found at provider (treated as upstream error)
* 401/403 status codes indicate authentication/authorization issues (gateway configuration errors)
* Other 4xx status codes indicate client errors
* Special client errors (like JSON format validation) are classified as client_error
*
* Note: Error classification is separate from health tracking. The health tracking system
* (api-key-health.ts) independently handles 401/403 errors for uptime routing purposes
* by permanently blacklisting keys with these status codes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const INVALID_PROVIDER_CREDENTIAL_PATTERNS = [ | ||
| /api key not valid/i, | ||
| /api key not found/i, | ||
| /please pass a valid api key/i, | ||
| ]; |
There was a problem hiding this comment.
PR title/description describe changing Anthropic determineStopReason to return null for unknown finish reasons, but this PR's diff does not include any changes to apps/gateway/src/anthropic/anthropic.ts (where determineStopReason currently defaults to returning "end_turn"). Please either include the missing Anthropic handler change or update the PR title/description to reflect the actual changes (provider credential detection + alternate-key retry/health behavior).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1772157b56
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| (errorType === "gateway_error" && | ||
| ((statusCode !== undefined && | ||
| (statusCode === 401 || statusCode === 403)) || | ||
| hasInvalidProviderCredentialError(errorText))) |
There was a problem hiding this comment.
Preserve gateway_error in alternate-key retry metadata
Enabling shouldRetryAlternateKey for gateway_error causes auth failures (e.g., 401/403 or 400 invalid-key payloads) to enter the retry path, but those retry attempts are still recorded via getErrorType(statusCode) in chat.ts, which maps these statuses to upstream_error. In scenarios with multiple keys on the same provider, metadata.routing[].error_type will now misreport credential failures as transient upstream failures, which can mislead routing diagnostics and dashboard analysis.
Useful? React with 👍 / 👎.
Problem
The
defaultcase indetermineStopReasonwas returning"end_turn"for any unrecognizedfinishReason, includingundefined— which occurs in mid-stream chunks where the model hasn't finished yet. This incorrectly signalled end-of-turn on partial responses.Fix
Return
nullinstead, matching the Anthropic API spec for unknown / not-yet-finished states.Change
Summary by CodeRabbit
Bug Fixes
Tests