fix: isolate key and provider health by steebchen · Pull Request #1977 · theopenco/llmgateway

steebchen · 2026-04-06T18:08:29Z

What changed

This PR now covers two related retry-health fixes in the gateway and worker:

Treat provider error payloads like API key not valid as invalid-credential failures even when the upstream status is 400, so same-provider fallback can rotate to the next key and mark the bad key unhealthy.
Isolate API key health by model, instead of sharing one health bucket across every model on the same provider key or env key.
Stop degrading derived provider routing health when one key fails but another key for the same provider succeeds on retry. Failed attempt logs are still stored for observability, but same-provider recovered retries no longer count against provider uptime.

Why

Two different skew problems showed up in routing:

Some providers surface invalid credentials via error text on 400 responses, which meant the gateway would not recognize the failure as credential-related and would not retry the next key.
Key and provider health were being penalized too broadly. A bad key on one model could skew selection for other models, and an intermediate same-provider retry failure could make the provider itself look less healthy even when the request ultimately succeeded on another key for that provider.

Impact

Requests can recover from invalid-key payloads by retrying another configured key for the same provider.
Key failover decisions are now model-scoped, so one model's failures do not poison the same key for unrelated models.
Provider routing health only drops when that provider actually failed end-to-end for the request, not when a same-provider retry recovered successfully.

Root cause

Retry classification only recognized credential failures by status code, and the in-memory health tracking plus derived provider routing stats were both aggregating failures too broadly.

Validation

pnpm format
pnpm vitest run apps/gateway/src/lib/api-key-health.spec.ts apps/gateway/src/lib/cached-queries.spec.ts apps/gateway/src/lib/round-robin-env.spec.ts apps/gateway/src/chat/tools/get-provider-env.spec.ts apps/worker/src/services/stats-calculator.spec.ts --no-file-parallelism
pnpm vitest run apps/gateway/src/chat/tools/get-finish-reason-from-error.spec.ts apps/gateway/src/chat/tools/retry-with-fallback.spec.ts apps/gateway/src/lib/api-key-health.spec.ts apps/gateway/src/fallback.spec.ts --no-file-parallelism

Summary by CodeRabbit

New Features
- API key health tracking is now scoped to specific models, enabling smarter key rotation decisions
- Enhanced detection of invalid API key errors for improved retry handling
- Alternate API key retry logic expanded for specific failure scenarios
Bug Fixes
- Fixed statistics calculation to properly exclude same-provider recovered retries from health metrics
- Improved server timing reliability in tests
Tests
- Added comprehensive coverage for model-scoped key selection and alternate key retry behavior

coderabbitai · 2026-04-06T18:08:37Z

Walkthrough

Introduced model-scoped API key health tracking and selection across the provider environment system. Added selectionScope parameter to health metrics, round-robin key selection, and provider key lookup functions. Implemented shouldRetryAlternateKey helper for intelligent alternate-key retry decisions. Added credential error detection via hasInvalidProviderCredentialError. Migrated from requestId to baseModelName for health tracking and key selection scoping. Updated error classification to treat invalid credential errors as gateway_error.

Changes

Cohort / File(s)	Summary
Credential Error Detection `apps/gateway/src/lib/provider-auth-errors.ts`	New exported helper `hasInvalidProviderCredentialError()` with regex patterns for detecting invalid API key/credential errors.
Error Classification Updates `apps/gateway/src/chat/tools/get-finish-reason-from-error.ts`, `apps/gateway/src/chat/tools/get-finish-reason-from-error.spec.ts`	Extended `getFinishReasonFromError` to classify HTTP 400 with invalid credential error payloads as `gateway_error`; added test coverage.
Health Tracking Scoping `apps/gateway/src/lib/api-key-health.ts`, `apps/gateway/src/lib/api-key-health.spec.ts`	Added optional `selectionScope` parameter to all health/metrics functions; replaced error-text matching with `hasInvalidProviderCredentialError`; added test for credential-based permanent blacklisting.
Round-Robin Selection Scoping `apps/gateway/src/lib/round-robin-env.ts`, `apps/gateway/src/lib/round-robin-env.spec.ts`	Added optional `selectionScope` parameter to `getRoundRobinValue` and `peekRoundRobinValue`; forwarded scope to health/metrics lookups; added test validating model-scoped isolation.
Provider Key Selection & Environment `apps/gateway/src/lib/cached-queries.ts`, `apps/gateway/src/lib/cached-queries.spec.ts`, `apps/gateway/src/chat/tools/get-provider-env.ts`, `apps/gateway/src/chat/tools/get-provider-env.spec.ts`	Renamed parameter from `_selectionKey` to `selectionScope` in `findProviderKey`/`findCustomProviderKey`; added `selectionScope` to `GetProviderEnvOptions`; threaded scope through round-robin calls; added test validating scope-based key selection differences.
Alternate-Key Retry Logic `apps/gateway/src/chat/tools/retry-with-fallback.ts`, `apps/gateway/src/chat/tools/retry-with-fallback.spec.ts`	Added exported `shouldRetryAlternateKey()` helper determining when to rotate API keys; checks for retryable errors, auth failures (401/403), and invalid credentials; comprehensive test coverage.
Provider Context & Chat Flow `apps/gateway/src/chat/tools/resolve-provider-context.ts`, `apps/gateway/src/chat/chat.ts`	Replaced `requestId` with `baseModelName` in provider key lookups; added `selectionScope: baseModelName` to `getProviderEnv` calls; integrated `shouldRetryAlternateKey` into error handling; extended key-health reporting calls with `baseModelName`.
Integration & End-to-End Tests `apps/gateway/src/api.spec.ts`, `apps/gateway/src/fallback.spec.ts`, `apps/gateway/src/graceful-shutdown.spec.ts`, `apps/gateway/src/test-utils/mock-openai-server.ts`	Updated SSE test to validate scoped metrics; added retry test cases for auth failures and invalid keys; added `waitForServerListening` helper; mock server now handles `TRIGGER_FAIL_ONCE_INVALID_KEY` trigger.
Worker Stats Aggregation `apps/worker/src/services/stats-calculator.ts`, `apps/worker/src/services/stats-calculator.spec.ts`	Added `excludeRecoveredSameProviderRetry()` predicate to filter same-provider recovered retries from health/error stats; comprehensive test validating provider-scoped recovery isolation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ChatHandler as Chat Handler
    participant KeySelector as Provider<br/>Key Selector
    participant HealthTracker as Health<br/>Tracker
    participant Provider as Upstream<br/>Provider

    Client->>ChatHandler: Request with model (baseModelName)
    ChatHandler->>KeySelector: findProviderKey(org, provider,<br/>selectionScope: baseModelName)
    KeySelector->>HealthTracker: isTrackedKeyHealthy(keyId,<br/>selectionScope: baseModelName)
    HealthTracker-->>KeySelector: healthy? (scoped to model)
    KeySelector-->>ChatHandler: primaryKey

    ChatHandler->>Provider: Send request with primaryKey
    Provider-->>ChatHandler: Error (401/403/400)

    ChatHandler->>ChatHandler: shouldRetryAlternateKey(errorType,<br/>statusCode, errorText)?
    alt Retry Eligible
        ChatHandler->>KeySelector: findProviderKey(org, provider,<br/>selectionScope: baseModelName,<br/>excludedKeyIds: [primaryKey])
        KeySelector->>HealthTracker: isTrackedKeyHealthy(alternateKeyId,<br/>selectionScope: baseModelName)
        HealthTracker-->>KeySelector: healthy? (scoped to model)
        KeySelector-->>ChatHandler: alternateKey

        ChatHandler->>HealthTracker: reportTrackedKeyError(primaryKeyId,<br/>statusCode, errorText,<br/>selectionScope: baseModelName)
        HealthTracker-->>ChatHandler: Scoped metrics updated

        ChatHandler->>Provider: Retry with alternateKey
        Provider-->>ChatHandler: Success
        ChatHandler->>HealthTracker: reportTrackedKeySuccess(alternateKeyId,<br/>selectionScope: baseModelName)
    else Not Eligible
        ChatHandler->>HealthTracker: reportTrackedKeyError(primaryKeyId,<br/>statusCode, errorText,<br/>selectionScope: baseModelName)
        ChatHandler-->>Client: Return error
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~90 minutes

Possibly related PRs

feat: support multi-key provider routing #1887 — Modifies provider-key selection API by adding request-scoped selection argument to findProviderKey/findCustomProviderKey; this PR extends that pattern with selectionScope scoping for health and metrics.
feat: add uptime-aware routing for multiple API keys #1231 — Introduces scoped metrics tracking and optional parameters to API-key health and round-robin selection functions; this PR builds on that foundation with credential error detection and alternate-key retry integration.
fix(gateway): ignore non-exempt provider 4xx for uptime #1824 — Modifies invalid-provider-credential handling in get-finish-reason-from-error and api-key-health; this PR introduces credential error pattern detection and integrates it into retry decisions.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.62% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: isolate key and provider health' directly and specifically describes the main objective of the PR: implementing model-scoped key health isolation and provider health adjustments.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/retry-invalid-api-key-payloads

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…-key-payloads

Copilot

Pull request overview

This PR refines retry/health behavior across the gateway and worker so that (1) invalid-credential failures are detected even when upstream responds with 400, (2) API key health is tracked per-model (not globally per key), and (3) provider routing health stats aren’t degraded by same-provider retries that ultimately recover.

Changes:

Add invalid-provider-credential text detection and use it in finish-reason classification, retry decisions, and key blacklisting.
Scope key health tracking/selection by model via a selectionScope plumbed through env-key and DB-key selection paths.
Exclude same-provider recovered retry attempts from worker-generated routing health statistics, with supporting tests.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
apps/worker/src/services/stats-calculator.ts	Filters out recovered same-provider retry attempts from minute-level stats aggregation.
apps/worker/src/services/stats-calculator.spec.ts	Adds coverage ensuring recovered same-provider retries don’t degrade mapping/model health stats.
apps/gateway/src/test-utils/mock-openai-server.ts	Adds a mock scenario for invalid-key payloads returned as `400` on first attempt.
apps/gateway/src/lib/round-robin-env.ts	Threads `selectionScope` into env key selection so health is isolated by scope.
apps/gateway/src/lib/round-robin-env.spec.ts	Tests env-key health isolation per model scope.
apps/gateway/src/lib/provider-auth-errors.ts	New helper for detecting invalid credential error payload patterns.
apps/gateway/src/lib/cached-queries.ts	Threads `selectionScope` into DB provider-key selection and health checks.
apps/gateway/src/lib/cached-queries.spec.ts	Tests tracked-key health isolation per model scope.
apps/gateway/src/lib/api-key-health.ts	Adds model-scoped health keys and uses invalid-credential text detection for permanent blacklisting.
apps/gateway/src/lib/api-key-health.spec.ts	Adds a test for permanently blacklisting invalid-key text on an otherwise-ignored `4xx`.
apps/gateway/src/graceful-shutdown.spec.ts	Stabilizes tests by waiting for the server to be listening before making requests.
apps/gateway/src/fallback.spec.ts	Adds fallback tests for same-provider key rotation on auth failures and invalid-key payloads; resets key health between tests.
apps/gateway/src/chat/tools/retry-with-fallback.ts	Introduces `shouldRetryAlternateKey` to allow same-provider key rotation for auth/invalid-key payloads.
apps/gateway/src/chat/tools/retry-with-fallback.spec.ts	Adds unit tests for `shouldRetryAlternateKey`.
apps/gateway/src/chat/tools/resolve-provider-context.ts	Passes base model as `selectionScope` into env/key selection.
apps/gateway/src/chat/tools/get-provider-env.ts	Accepts `selectionScope` and forwards it into round-robin env key selection.
apps/gateway/src/chat/tools/get-provider-env.spec.ts	Tests that `selectionScope` affects env key health/selection.
apps/gateway/src/chat/tools/get-finish-reason-from-error.ts	Treats invalid-key payload text as credential failure (gateway_error) even for `400`.
apps/gateway/src/chat/tools/get-finish-reason-from-error.spec.ts	Adds coverage for invalid-key payloads on `400`.
apps/gateway/src/chat/chat.ts	Plumbs model scope into key health reporting and uses `shouldRetryAlternateKey` for same-provider retries.
apps/gateway/src/api.spec.ts	Updates tracked-key health assertions to use scoped health keys.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T19:20:52Z

apps/gateway/src/lib/cached-queries.ts

 */
 export async function findCustomProviderKey(
 	organizationId: string,
 	customProviderName: string,
-	_selectionKey?: string,
+	selectionScope?: string,
 	excludedKeyIds?: ReadonlySet<string>,
 ): Promise<ProviderKey | undefined> {
 	const results = await db
 		.select()
 		.from(providerKeyTable)
 		.where(
 			and(
 				eq(providerKeyTable.status, "active"),
 				eq(providerKeyTable.organizationId, organizationId),
 				eq(providerKeyTable.provider, "custom"),
 				eq(providerKeyTable.name, customProviderName),
 			),
 		)
 		.orderBy(asc(providerKeyTable.createdAt), asc(providerKeyTable.id));
-	return selectProviderKeyWithFailover(results, excludedKeyIds);
+	return selectProviderKeyWithFailover(results, selectionScope, excludedKeyIds);
 }

 /**
 * Find a provider key by organization and provider (cacheable)
 */
 export async function findProviderKey(
 	organizationId: string,
 	provider: string,
-	_selectionKey?: string,
+	selectionScope?: string,
 	excludedKeyIds?: ReadonlySet<string>,
 ): Promise<ProviderKey | undefined> {
 	const results = await db
 		.select()
 		.from(providerKeyTable)
 		.where(
 			and(
 				eq(providerKeyTable.status, "active"),
 				eq(providerKeyTable.organizationId, organizationId),
 				eq(providerKeyTable.provider, provider),
 			),
 		)
 		.orderBy(asc(providerKeyTable.createdAt), asc(providerKeyTable.id));
-	return selectProviderKeyWithFailover(results, excludedKeyIds);
+	return selectProviderKeyWithFailover(results, selectionScope, excludedKeyIds);


findProviderKey / findCustomProviderKey now treat the 3rd argument as selectionScope (used to scope in-memory key health buckets). There are still call sites in the repo passing requestId/selectionKey as the 3rd argument (e.g. apps/gateway/src/moderations/moderations.ts and apps/gateway/src/videos/videos.ts), which will accidentally scope health per-request and can cause unbounded keyHealthMap growth + inconsistent failover behavior. Please update those call sites to pass a stable scope (e.g. base model name) or undefined, and if request-based cache-busting is still needed, reintroduce it as a separate parameter rather than overloading selectionScope.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b45c134a5d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T19:21:08Z

apps/gateway/src/lib/cached-queries.ts

+			metrics: getTrackedKeyMetrics(item.id, selectionScope),
 		}))
-		.filter(({ item }) => isTrackedKeyHealthy(item.id));
+		.filter(({ item }) => isTrackedKeyHealthy(item.id, selectionScope));


Restore backward-compatible selection-key semantics

This change makes selectionScope participate in key-health lookups (getTrackedKeyMetrics / isTrackedKeyHealthy), but the third argument on findProviderKey/findCustomProviderKey was previously an ignored _selectionKey, and some callers still pass per-request IDs (for example apps/gateway/src/moderations/moderations.ts around lines 380-401 and apps/gateway/src/videos/videos.ts around line 2164). Because those values are unique per request while health reporting there remains unscoped, key selection effectively sees a fresh healthy bucket every time and can keep choosing bad keys instead of failing over.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

apps/gateway/src/chat/chat.ts (2)
2694-2724: ⚠️ Potential issue | 🟠 Major

Keep the tracked key identity in sync when a regional env token replaces the selected credential.

Lines 2719-2724, 2768-2773, 2794-2799, and 2845-2850 can swap usedToken to a region-specific env value, but providerKey, envVarName, and configIndex still describe the original DB/base key. The later rememberFailedKey / reportKey* calls will quarantine or heal the wrong credential, and same-provider retry can keep reusing the same bad regional token while rotating unrelated keys.

Also applies to: 2761-2774, 2778-2800, 2838-2851
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/chat.ts` around lines 2694 - 2724, The region-specific
env token replacement swaps usedToken but leaves
providerKey/envVarName/configIndex pointing at the original DB key, causing
wrong keys to be reported or quarantined; update the logic in the blocks that
call getRegionSpecificEnvValue (around where usedRegion is set after
resolveRegionFromProviderKey) to also update the tracking identity: when
regionToken replaces usedToken, set providerKey.token (or create a new
providerKey-like object) and update envVarName and configIndex to reflect the
env-derived credential so subsequent calls to rememberFailedKey and reportKey*
operate on the actual credential in use; ensure the same change is applied to
all similar blocks (the ones you noted at 2719-2724, 2768-2773, 2794-2799,
2845-2850 and the other mirrored ranges).
4846-4860: ⚠️ Potential issue | 🟠 Major

Do not let recovered same-provider retries mark the provider as failed.

These branches now record failed attempts for alternate-key retries, but the later failedMap / providerScores enrichment still keys failures only by provider at Lines 5335-5351 and 8502-8516. If key A fails and key B succeeds on the same provider, the provider still ends up marked failed, so derived provider uptime is still penalized.
Possible follow-up
-const failedMap = new Map(
-	routingAttempts
-		.filter((a) => !a.succeeded)
-		.map((f) => [f.provider, f]),
-);
+const recoveredProviders = new Set(
+	routingAttempts.filter((a) => a.succeeded).map((a) => a.provider),
+);
+const failedMap = new Map(
+	routingAttempts
+		.filter((a) => !a.succeeded && !recoveredProviders.has(a.provider))
+		.map((f) => [f.provider, f]),
+);
Also applies to: 5123-5137, 8231-8241
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/chat.ts` around lines 4846 - 4860, The current logic
calls rememberFailedKey during same-provider alternate-key retries which causes
the provider to be treated as failed even if another key on the same provider
later succeeds; update the flow so that when shouldRetryAlternateKey(...)
triggers and you call tryResolveAlternateKeyForCurrentProvider(true) you do NOT
mark the provider as failed up-front — instead only record the specific key
failure (or pass a flag to rememberFailedKey to avoid marking provider-level
failure), and only add entries to failedMap/providerScores (the provider-level
failure enrichment used later) when all keys for that provider have actually
failed or when there is no successful alternate key (i.e., check the result of
tryResolveAlternateKeyForCurrentProvider and only escalate to provider-level
failure if it returns falsy). Ensure changes touch the branches around
shouldRetryAlternateKey, rememberFailedKey,
tryResolveAlternateKeyForCurrentProvider and the later failedMap/providerScores
enrichment so provider uptime isn’t penalized when an alternate key recovers the
request.

🧹 Nitpick comments (1)

apps/gateway/src/lib/api-key-health.ts (1)
47-68: Consider monitoring memory growth with model-scoped health tracking.

With model-scoped health keys, keyHealthMap entries grow as O(keys × models) rather than O(keys). While MAX_HISTORY_SIZE and METRICS_WINDOW_MS bound per-entry memory, there's no mechanism to prune stale entries for models no longer in use.

For most deployments this is acceptable, but consider:

Monitoring getTrackedKeyCount() in production

Adding TTL-based eviction for entries with no recent activity if memory becomes a concern
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/lib/api-key-health.ts` around lines 47 - 68, The
keyHealthMap can grow unbounded when using model-scoped keys; add a TTL-based
eviction to prune entries with no recent activity: augment KeyHealth entries
with a lastActivity timestamp updated in recordError/recordSuccess, expose
getTrackedKeyCount() for monitoring, and implement a periodic cleanup task
(interval based on METRICS_WINDOW_MS or a new CLEANUP_INTERVAL_MS) that removes
map entries whose lastActivity is older than a configurable TTL (e.g.,
METRICS_WINDOW_MS or a new KEY_TTL_MS) while preserving MAX_HISTORY_SIZE
semantics; reference keyHealthMap, KeyHealth, recordError/recordSuccess,
getTrackedKeyCount, MAX_HISTORY_SIZE and METRICS_WINDOW_MS to locate where to
update and where to add the cleanup timer.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/gateway/src/chat/chat.ts`:
- Around line 2694-2704: Calls to findCustomProviderKey and findProviderKey are
missing the new selectionScope argument in several places; update every
invocation (including the earlier call near the top and the ones that correspond
to the referenced blocks) to pass the current selectionScope value so the key
lookups are region/selection-scoped. Specifically, add selectionScope as the
third parameter when calling findCustomProviderKey(project.organizationId,
customProviderName, selectionScope) and when calling
findProviderKey(project.organizationId, usedProvider, selectionScope) (and
similarly for the other direct-provider calls noted), ensuring you use the same
selectionScope variable used elsewhere in this file so custom-provider
validation and region locking are correctly scoped.

In `@apps/gateway/src/lib/provider-auth-errors.ts`:
- Around line 1-5: The current INVALID_PROVIDER_CREDENTIAL_PATTERNS only matches
human-readable message text and misses structured error codes like
"invalid_api_key"; update the pattern array
(INVALID_PROVIDER_CREDENTIAL_PATTERNS) to include a case-insensitive regex that
matches the error code token (e.g. /\binvalid_api_key\b/i or a JSON-ish code
match) so payloads carrying code: "invalid_api_key" trigger the same
auth-failure handling, and apply the same change to the other identical pattern
array found later in the file (the second occurrence at lines 12-14).

In `@apps/gateway/src/test-utils/mock-openai-server.ts`:
- Around line 770-785: The invalid-key trigger handler (checking userMessage for
"TRIGGER_FAIL_ONCE_INVALID_KEY") is colliding with the more general
"TRIGGER_FAIL_ONCE" handler and causing failOnceCounter to be incremented twice;
make the triggers mutually exclusive by changing the control flow so only one
handler runs: either (A) convert the generic "TRIGGER_FAIL_ONCE" handler to an
else if so it won’t run when "TRIGGER_FAIL_ONCE_INVALID_KEY" matched, or (B)
make the invalid-key check more specific (e.g., exact match or
word-boundary/unique token) so
userMessage.includes("TRIGGER_FAIL_ONCE_INVALID_KEY") cannot also match the
generic "TRIGGER_FAIL_ONCE"; adjust code around failOnceCounter, userMessage,
TRIGGER_FAIL_ONCE_INVALID_KEY and TRIGGER_FAIL_ONCE accordingly.

---

Outside diff comments:
In `@apps/gateway/src/chat/chat.ts`:
- Around line 2694-2724: The region-specific env token replacement swaps
usedToken but leaves providerKey/envVarName/configIndex pointing at the original
DB key, causing wrong keys to be reported or quarantined; update the logic in
the blocks that call getRegionSpecificEnvValue (around where usedRegion is set
after resolveRegionFromProviderKey) to also update the tracking identity: when
regionToken replaces usedToken, set providerKey.token (or create a new
providerKey-like object) and update envVarName and configIndex to reflect the
env-derived credential so subsequent calls to rememberFailedKey and reportKey*
operate on the actual credential in use; ensure the same change is applied to
all similar blocks (the ones you noted at 2719-2724, 2768-2773, 2794-2799,
2845-2850 and the other mirrored ranges).
- Around line 4846-4860: The current logic calls rememberFailedKey during
same-provider alternate-key retries which causes the provider to be treated as
failed even if another key on the same provider later succeeds; update the flow
so that when shouldRetryAlternateKey(...) triggers and you call
tryResolveAlternateKeyForCurrentProvider(true) you do NOT mark the provider as
failed up-front — instead only record the specific key failure (or pass a flag
to rememberFailedKey to avoid marking provider-level failure), and only add
entries to failedMap/providerScores (the provider-level failure enrichment used
later) when all keys for that provider have actually failed or when there is no
successful alternate key (i.e., check the result of
tryResolveAlternateKeyForCurrentProvider and only escalate to provider-level
failure if it returns falsy). Ensure changes touch the branches around
shouldRetryAlternateKey, rememberFailedKey,
tryResolveAlternateKeyForCurrentProvider and the later failedMap/providerScores
enrichment so provider uptime isn’t penalized when an alternate key recovers the
request.

---

Nitpick comments:
In `@apps/gateway/src/lib/api-key-health.ts`:
- Around line 47-68: The keyHealthMap can grow unbounded when using model-scoped
keys; add a TTL-based eviction to prune entries with no recent activity: augment
KeyHealth entries with a lastActivity timestamp updated in
recordError/recordSuccess, expose getTrackedKeyCount() for monitoring, and
implement a periodic cleanup task (interval based on METRICS_WINDOW_MS or a new
CLEANUP_INTERVAL_MS) that removes map entries whose lastActivity is older than a
configurable TTL (e.g., METRICS_WINDOW_MS or a new KEY_TTL_MS) while preserving
MAX_HISTORY_SIZE semantics; reference keyHealthMap, KeyHealth,
recordError/recordSuccess, getTrackedKeyCount, MAX_HISTORY_SIZE and
METRICS_WINDOW_MS to locate where to update and where to add the cleanup timer.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fa8012b2-8a2d-4667-8256-d0c644e96a9d

📥 Commits

Reviewing files that changed from the base of the PR and between 72569f6 and b45c134.

📒 Files selected for processing (21)

apps/gateway/src/api.spec.ts
apps/gateway/src/chat/chat.ts
apps/gateway/src/chat/tools/get-finish-reason-from-error.spec.ts
apps/gateway/src/chat/tools/get-finish-reason-from-error.ts
apps/gateway/src/chat/tools/get-provider-env.spec.ts
apps/gateway/src/chat/tools/get-provider-env.ts
apps/gateway/src/chat/tools/resolve-provider-context.ts
apps/gateway/src/chat/tools/retry-with-fallback.spec.ts
apps/gateway/src/chat/tools/retry-with-fallback.ts
apps/gateway/src/fallback.spec.ts
apps/gateway/src/graceful-shutdown.spec.ts
apps/gateway/src/lib/api-key-health.spec.ts
apps/gateway/src/lib/api-key-health.ts
apps/gateway/src/lib/cached-queries.spec.ts
apps/gateway/src/lib/cached-queries.ts
apps/gateway/src/lib/provider-auth-errors.ts
apps/gateway/src/lib/round-robin-env.spec.ts
apps/gateway/src/lib/round-robin-env.ts
apps/gateway/src/test-utils/mock-openai-server.ts
apps/worker/src/services/stats-calculator.spec.ts
apps/worker/src/services/stats-calculator.ts

coderabbitai · 2026-04-07T19:24:39Z

apps/gateway/src/chat/chat.ts

 			providerKey = await findCustomProviderKey(
 				project.organizationId,
 				customProviderName,
-				requestId,
+				baseModelName,
 			);
 		} else {
 			providerKey = await findProviderKey(
 				project.organizationId,
 				usedProvider,
-				requestId,
+				baseModelName,
 			);


⚠️ Potential issue | 🟠 Major

Complete the selectionScope propagation for the earlier key lookups.

These lookups are scoped now, but Line 1437 and Lines 1806-1809 / 2457-2460 still call findCustomProviderKey / findProviderKey without the new third argument. That leaves custom-provider validation and direct-provider region locking on global key health, so another model's failures can still hide a healthy key or lock the wrong region.

Also applies to: 2778-2788

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/chat/chat.ts` around lines 2694 - 2704, Calls to findCustomProviderKey and findProviderKey are missing the new selectionScope argument in several places; update every invocation (including the earlier call near the top and the ones that correspond to the referenced blocks) to pass the current selectionScope value so the key lookups are region/selection-scoped. Specifically, add selectionScope as the third parameter when calling findCustomProviderKey(project.organizationId, customProviderName, selectionScope) and when calling findProviderKey(project.organizationId, usedProvider, selectionScope) (and similarly for the other direct-provider calls noted), ensuring you use the same selectionScope variable used elsewhere in this file so custom-provider validation and region locking are correctly scoped.

coderabbitai · 2026-04-07T19:24:39Z

apps/gateway/src/lib/provider-auth-errors.ts

+const INVALID_PROVIDER_CREDENTIAL_PATTERNS = [
+	/api key not valid/i,
+	/api key not found/i,
+	/please pass a valid api key/i,
+];


⚠️ Potential issue | 🟠 Major

Add invalid_api_key code matching to avoid missed credential failures.

At Line 2-Line 4, matching only message text is brittle. Payloads that carry code: "invalid_api_key" with different wording can bypass auth-failure handling, preventing intended retry/blacklist behavior.

🔧 Proposed hardening

const INVALID_PROVIDER_CREDENTIAL_PATTERNS = [ + /\binvalid_api_key\b/i, /api key not valid/i, /api key not found/i, /please pass a valid api key/i, + /incorrect api key provided/i, ];

Also applies to: 12-14

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/lib/provider-auth-errors.ts` around lines 1 - 5, The current INVALID_PROVIDER_CREDENTIAL_PATTERNS only matches human-readable message text and misses structured error codes like "invalid_api_key"; update the pattern array (INVALID_PROVIDER_CREDENTIAL_PATTERNS) to include a case-insensitive regex that matches the error code token (e.g. /\binvalid_api_key\b/i or a JSON-ish code match) so payloads carrying code: "invalid_api_key" trigger the same auth-failure handling, and apply the same change to the other identical pattern array found later in the file (the second occurrence at lines 12-14).

coderabbitai · 2026-04-07T19:24:39Z

apps/gateway/src/test-utils/mock-openai-server.ts

+	// Check if this request should fail on the first attempt but succeed on retry
+	if (userMessage.includes("TRIGGER_FAIL_ONCE_INVALID_KEY")) {
+		failOnceCounter++;
+		if (failOnceCounter === 1) {
+			c.status(400);
+			return c.json({
+				error: {
+					message: "API key not valid. Please pass a valid API key.",
+					type: "authentication_error",
+					param: null,
+					code: "invalid_api_key",
+				},
+			});
+		}
+		// Subsequent requests succeed - fall through to normal response
+	}


⚠️ Potential issue | 🟡 Minor

Avoid double-counting failOnceCounter for the invalid-key trigger.

At Line 771, this trigger is a substring match of TRIGGER_FAIL_ONCE. On retries, both handlers run, incrementing the shared counter twice and increasing cross-test coupling.

💡 Suggested fix

-if (userMessage.includes("TRIGGER_FAIL_ONCE_INVALID_KEY")) { +if (userMessage.includes("TRIGGER_FAIL_ONCE_INVALID_KEY")) { failOnceCounter++; if (failOnceCounter === 1) { c.status(400); return c.json({ error: { message: "API key not valid. Please pass a valid API key.", type: "authentication_error", param: null, code: "invalid_api_key", }, }); } // Subsequent requests succeed - fall through to normal response -} +} else if (userMessage.includes("TRIGGER_FAIL_ONCE")) { + // existing generic fail-once logic +}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/test-utils/mock-openai-server.ts` around lines 770 - 785, The invalid-key trigger handler (checking userMessage for "TRIGGER_FAIL_ONCE_INVALID_KEY") is colliding with the more general "TRIGGER_FAIL_ONCE" handler and causing failOnceCounter to be incremented twice; make the triggers mutually exclusive by changing the control flow so only one handler runs: either (A) convert the generic "TRIGGER_FAIL_ONCE" handler to an else if so it won’t run when "TRIGGER_FAIL_ONCE_INVALID_KEY" matched, or (B) make the invalid-key check more specific (e.g., exact match or word-boundary/unique token) so userMessage.includes("TRIGGER_FAIL_ONCE_INVALID_KEY") cannot also match the generic "TRIGGER_FAIL_ONCE"; adjust code around failOnceCounter, userMessage, TRIGGER_FAIL_ONCE_INVALID_KEY and TRIGGER_FAIL_ONCE accordingly.

steebchen changed the title ~~fix: retry invalid api key payloads~~ fix: isolate key and provider health Apr 6, 2026

steebchen added 3 commits April 7, 2026 14:36

fix: retry invalid api key payloads

b8579f3

fix: isolate key and provider health

8391b63

test: harden flaky gateway specs

b49cce3

steebchen force-pushed the fix/retry-invalid-api-key-payloads branch from 99f947c to b49cce3 Compare April 7, 2026 07:52

Merge remote-tracking branch 'origin/main' into fix/retry-invalid-api…

b45c134

…-key-payloads

steebchen marked this pull request as ready for review April 7, 2026 19:15

Copilot AI review requested due to automatic review settings April 7, 2026 19:15

Copilot started reviewing on behalf of steebchen April 7, 2026 19:16 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Apr 7, 2026

View reviewed changes

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: isolate key and provider health#1977

fix: isolate key and provider health#1977
steebchen wants to merge 4 commits intomainfrom
fix/retry-invalid-api-key-payloads

steebchen commented Apr 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 6, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 7, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 7, 2026

Uh oh!

coderabbitai bot Apr 7, 2026

Uh oh!

coderabbitai bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steebchen commented Apr 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Impact

Root cause

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steebchen commented Apr 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 6, 2026 •

edited

Loading