fix: surface anthropic + bedrock prompt cache tokens#1992
fix: surface anthropic + bedrock prompt cache tokens#1992
Conversation
Anthropic and Bedrock prompt caching pass-through was unreliable: client- supplied cache_control markers were partially handled, cache token usage was inconsistently surfaced through the openai-compat and native /v1/messages paths, and there was no e2e coverage for streaming or bedrock. This change: - Honors caller-supplied cache_control on text content parts in /v1/chat/completions (new optional schema field) and forwards them verbatim to Anthropic, mapping to cachePoint blocks for Bedrock. Falls back to the existing length-based heuristic when no marker is provided. - Preserves cache_control on system + message text blocks coming through the native /v1/messages endpoint, and surfaces cache_creation_input_tokens / cache_read_input_tokens on responses (always emitted, set to 0 when inapplicable, matching Anthropic's actual API). - Surfaces cache_creation_tokens alongside cached_tokens in prompt_tokens_details on the openai-compat response, including streaming chunks via a new normalizeAnthropicUsage helper. - Strips cache_control from text parts when routing to non-Anthropic / non-Bedrock providers so OpenAI/Google/etc. don't receive an unknown field. - Adds end-to-end tests covering: native /v1/messages with explicit cache_control, openai-compat for both Anthropic and Bedrock, streaming for both, and explicit cache_control on /v1/chat/completions. Each asserts cached_tokens > 0 after a retry-with-backoff (Anthropic prompt cache writes are eventually consistent), and where applicable asserts the per-token cached cost is strictly less than the per-token uncached cost within the same response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Make cache_creation_input_tokens and cache_read_input_tokens optional with a default of 0 in anthropicResponseSchema. Anthropic emits these on caching-supported models today, but a non-optional schema would fail validation if an older Claude model, a beta endpoint, or a future API change ever omits them — turning a graceful "no caching info" into a 500. The downstream conversion code already handles 0 correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
packages/actions/src/prepare-request-body.ts (1)
1412-1460:⚠️ Potential issue | 🟠 MajorThe Bedrock fallback path no longer preserves legacy system-message concatenation.
When no explicit
cache_controlis present, this still collects array-based system content one text part at a time. That changes the heuristic from “cache the whole system message if its combined text is long enough” to “cache each part independently”, so a long multipart system prompt can now misscachePointentirely. Preserve per-part handling only in the explicit-marker path; otherwise concatenate each system message’s text first. UsingisTextContenthere would also remove theas any[]escape hatch.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/actions/src/prepare-request-body.ts` around lines 1412 - 1460, The current loop over bedrockSystemMessages pushes each array part as its own block which breaks the legacy “concatenate whole system message” heuristic; update the handling in the collectedBedrockBlocks build so that when sysMsg.content is an array you first inspect parts using isTextContent and detect whether any part has an explicit cache_control: if any part has cache_control, push each text part separately with hasExplicitCacheControl set appropriately (preserving per-part markers), otherwise concatenate all text parts into one string and push a single block with hasExplicitCacheControl=false; keep the rest of the logic (systemContent, bedrockCacheControlCount, bedrockMaxCacheControlBlocks, bedrockMinCacheableChars) unchanged.apps/gateway/src/chat/tools/transform-streaming-to-openai.ts (1)
1221-1248:⚠️ Potential issue | 🟠 MajorBedrock streaming drops
cache_creation_tokenson cache writes.
cacheWriteTokensis included inprompt_tokens, butprompt_tokens_detailsis only emitted whencacheReadTokens > 0. A write-only cache hit in streaming mode therefore loses the new metric even though the non-streaming parser now returns it.🐛 Proposed fix
usage: { prompt_tokens: promptTokens, completion_tokens: data.usage.outputTokens ?? 0, total_tokens: data.usage.totalTokens ?? 0, - ...(cacheReadTokens > 0 && { + ...((cacheReadTokens > 0 || cacheWriteTokens > 0) && { prompt_tokens_details: { cached_tokens: cacheReadTokens, + ...(cacheWriteTokens > 0 && { + cache_creation_tokens: cacheWriteTokens, + }), }, }), },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/chat/tools/transform-streaming-to-openai.ts` around lines 1221 - 1248, The streaming metadata branch builds transformedData but only emits prompt_tokens_details when cacheReadTokens > 0, which drops cache_creation_tokens for write-only cache events; update the logic in the eventType === "metadata" handling (around transformedData construction) to include prompt_tokens_details whenever cacheWriteTokens > 0 (and include cached_tokens or cache_creation_tokens as appropriate) or when either cacheReadTokens > 0 or cacheWriteTokens > 0, ensuring cacheWriteTokens is represented in prompt_tokens_details and that prompt_tokens still sums inputTokens + cacheReadTokens + cacheWriteTokens.apps/gateway/src/chat/chat.ts (1)
8875-8912:⚠️ Potential issue | 🟠 MajorUpdate the documented
/completionsresponse schema before emittingcache_creation_tokens.
transformResponseToOpenai(...)can now populateusage.prompt_tokens_details.cache_creation_tokens, but the 200 schema above still declaresprompt_tokens_detailsas only{ cached_tokens }. OpenAPI docs and generated clients will stay out of sync with the actual response shape.📘 Suggested schema update
prompt_tokens_details: z .object({ cached_tokens: z.number(), + cache_creation_tokens: z.number().optional(), }) .optional(),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/chat/chat.ts` around lines 8875 - 8912, The OpenAPI response schema for the /completions endpoint is missing the new field usage.prompt_tokens_details.cache_creation_tokens that transformResponseToOpenai(...) can now populate; update the documented 200 response schema to add prompt_tokens_details.cache_creation_tokens (an integer/number, nullable if appropriate) alongside the existing cached_tokens entry so the emitted JSON shape matches what transformResponseToOpenai returns, and regenerate/update any client types or schema references that rely on that response definition.apps/gateway/src/chat/tools/extract-token-usage.ts (1)
102-115:⚠️ Potential issue | 🟠 MajorDon’t coerce omitted cache counters to
0.These branches turn “field omitted in this frame” into “provider reported zero”. Because
chat.tsoverwrites the running values wheneverextractTokenUsage()returns non-null on Lines 6571-6576, a later partial usage frame can erase an earlier non-zerocacheCreationTokens/cachedTokensvalue before the final response is built. Preservenullfor absent cache fields and only rebuild the prompt-side counters when those input counters were actually present.Also applies to: 118-133
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/chat/tools/extract-token-usage.ts` around lines 102 - 115, The aws-bedrock branch in extractTokenUsage currently coerces missing cache fields to 0 (inputTokens/cacheReadTokens/cacheWriteTokens), which can overwrite prior non-null running counters; instead, leave these as null when absent (e.g., set inputTokens/cacheReadTokens/cacheWriteTokens to null if undefined) and only compute promptTokens/prompt-side sums when the contributing values are actually present; preserve cachedTokens and cacheCreationTokens as null if their source fields are absent so later partial frames cannot zero-out earlier values; apply the same change to the other branch referenced (lines ~118-133) that handles similar cache fields.apps/gateway/src/anthropic/anthropic.ts (1)
340-387:⚠️ Potential issue | 🟠 Major
cache_controlis still dropped ontool_resultuser turns.The new preservation logic at Lines 397-425 never runs for user messages that hit the special-case branch at Lines 340-387. That branch still collapses the remaining text blocks into a plain string, so a payload like
[text(cache_control), tool_result]loses its explicit cache marker before it reaches/v1/chat/completions.Also applies to: 397-425
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/anthropic/anthropic.ts` around lines 340 - 387, The special-case branch that handles message.content with tool_result collapses remaining text blocks into a single string (the textContent construction and subsequent openaiMessages.push), which drops any cache_control markers; instead preserve text blocks (including cache_control) by collecting the text blocks as their original block objects rather than joining to a string and push a user message whose content is the array of blocks (or otherwise carry cache_control metadata) so the downstream preservation logic that expects block objects can run; update the code around toolResults, the textContent creation, and the openaiMessages.push for role:"user" to forward the blocks unchanged (referencing message.content, toolResults, combinedContent, and the openaiMessages pushes).
🧹 Nitpick comments (1)
apps/gateway/src/chat/tools/transform-streaming-to-openai.ts (1)
12-33: GivenormalizeAnthropicUsagea concrete type.This helper only reads a small, fixed usage shape, so
anyhides field-name drift in a pretty central transform path.♻️ Proposed refactor
+type AnthropicUsage = { + input_tokens?: number; + cache_creation_input_tokens?: number; + cache_read_input_tokens?: number; + output_tokens?: number; +}; + -function normalizeAnthropicUsage(usage: any): any { +function normalizeAnthropicUsage(usage: AnthropicUsage | null | undefined) {As per coding guidelines, "Never use
anyoras anyin TypeScript unless absolutely necessary".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/chat/tools/transform-streaming-to-openai.ts` around lines 12 - 33, The function normalizeAnthropicUsage currently accepts and returns any, which hides shape drift; define a concrete input type (e.g., interface AnthropicUsage { input_tokens?: number; cache_creation_input_tokens?: number; cache_read_input_tokens?: number; output_tokens?: number } ) and a concrete return type (e.g., NormalizedUsage | null with prompt_tokens, completion_tokens, total_tokens and optional prompt_tokens_details), change the signature to normalizeAnthropicUsage(usage: AnthropicUsage | null | undefined): NormalizedUsage | null, and update the implementation to use those typed fields (keeping the same logic for defaults and conditional prompt_tokens_details); add the new types in this file (or a nearby types file) and remove any use of any for this helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/gateway/src/anthropic/anthropic.ts`:
- Around line 535-545: The message_start payload is only emitting usage: {
input_tokens, output_tokens } while the code elsewhere (the local usage object
in anthropic.ts) now includes cache_creation_input_tokens and
cache_read_input_tokens, causing native streaming clients to see those fields as
undefined; update all places that construct or emit message_start.message.usage
(including the blocks around the existing usage declaration and the
message_start emission sites referenced) to include cache_creation_input_tokens
and cache_read_input_tokens (populated from the same usage object or initialized
to 0) so the emitted usage object consistently has { input_tokens,
output_tokens, cache_creation_input_tokens, cache_read_input_tokens } across the
codepaths.
In `@apps/gateway/src/chat/chat.ts`:
- Around line 7110-7118: The final (normal unbuffered) usage emission path still
omits prompt_tokens_details, so update the code that emits the final usage
payload (the block that sets doneSent = true) to include the same conditional
spread used elsewhere: include prompt_tokens_details when (cachedTokens !== null
|| (cacheCreationTokens !== null && cacheCreationTokens > 0)) with
cached_tokens: cachedTokens ?? 0 and cache_creation_tokens when applicable; also
ensure the forceStream JSON adapter (which copies usage from streamed chunks)
will receive/merge that prompt_tokens_details by copying usage including
prompt_tokens_details rather than overwriting it. Reference
prompt_tokens_details, cachedTokens, cacheCreationTokens, doneSent and
forceStream when making the changes.
In `@apps/gateway/src/native-anthropic-cache.e2e.ts`:
- Around line 136-145: Replace use of buildLongSystemPrompt() in the explicit
cache_control test cases with a below-threshold prompt so the tests exercise the
explicit cache_control plumbing rather than the legacy length heuristic; locate
the spots where longText = buildLongSystemPrompt() and the body.system entry
includes cache_control (the explicit-marker cases) and change to a short prompt
(e.g., buildShortSystemPrompt() or a hardcoded short string) that is
deliberately shorter than the heuristic threshold, and make the same change at
the second occurrence referenced around the other block (the lines near 476-487)
to ensure both explicit cache_control tests use the short fixture.
- Around line 65-88: The current assertCacheDiscountApplied(usage: any) mixes
OpenAI and Anthropic shapes and uses any; split it into two typed helpers and
update call sites: create assertCacheDiscountAppliedOpenAI(usage: {
prompt_tokens: number; prompt_tokens_details?: { cached_tokens?: number };
cost_usd_input?: number; cost_usd_cached_input?: number }) that preserves the
existing per-token cost assertion (use cached_tokens, prompt_tokens,
cost_usd_input, cost_usd_cached_input) and create
assertCacheDiscountAppliedAnthropic(usage: { input_tokens?: number;
cache_creation_input_tokens?: number; cache_read_input_tokens?: number }) that
uses Anthropic fields (treat cache_read_input_tokens as cachedTokens and derive
uncachedTokens from input_tokens and cache_read_input_tokens) and, since cost
fields are absent, only assert that cachedTokens > 0 and uncachedTokens > 0
(skip per-token cost comparison); replace usages of assertCacheDiscountApplied
to call the appropriate new helper and remove the any type.
---
Outside diff comments:
In `@apps/gateway/src/anthropic/anthropic.ts`:
- Around line 340-387: The special-case branch that handles message.content with
tool_result collapses remaining text blocks into a single string (the
textContent construction and subsequent openaiMessages.push), which drops any
cache_control markers; instead preserve text blocks (including cache_control) by
collecting the text blocks as their original block objects rather than joining
to a string and push a user message whose content is the array of blocks (or
otherwise carry cache_control metadata) so the downstream preservation logic
that expects block objects can run; update the code around toolResults, the
textContent creation, and the openaiMessages.push for role:"user" to forward the
blocks unchanged (referencing message.content, toolResults, combinedContent, and
the openaiMessages pushes).
In `@apps/gateway/src/chat/chat.ts`:
- Around line 8875-8912: The OpenAPI response schema for the /completions
endpoint is missing the new field
usage.prompt_tokens_details.cache_creation_tokens that
transformResponseToOpenai(...) can now populate; update the documented 200
response schema to add prompt_tokens_details.cache_creation_tokens (an
integer/number, nullable if appropriate) alongside the existing cached_tokens
entry so the emitted JSON shape matches what transformResponseToOpenai returns,
and regenerate/update any client types or schema references that rely on that
response definition.
In `@apps/gateway/src/chat/tools/extract-token-usage.ts`:
- Around line 102-115: The aws-bedrock branch in extractTokenUsage currently
coerces missing cache fields to 0
(inputTokens/cacheReadTokens/cacheWriteTokens), which can overwrite prior
non-null running counters; instead, leave these as null when absent (e.g., set
inputTokens/cacheReadTokens/cacheWriteTokens to null if undefined) and only
compute promptTokens/prompt-side sums when the contributing values are actually
present; preserve cachedTokens and cacheCreationTokens as null if their source
fields are absent so later partial frames cannot zero-out earlier values; apply
the same change to the other branch referenced (lines ~118-133) that handles
similar cache fields.
In `@apps/gateway/src/chat/tools/transform-streaming-to-openai.ts`:
- Around line 1221-1248: The streaming metadata branch builds transformedData
but only emits prompt_tokens_details when cacheReadTokens > 0, which drops
cache_creation_tokens for write-only cache events; update the logic in the
eventType === "metadata" handling (around transformedData construction) to
include prompt_tokens_details whenever cacheWriteTokens > 0 (and include
cached_tokens or cache_creation_tokens as appropriate) or when either
cacheReadTokens > 0 or cacheWriteTokens > 0, ensuring cacheWriteTokens is
represented in prompt_tokens_details and that prompt_tokens still sums
inputTokens + cacheReadTokens + cacheWriteTokens.
In `@packages/actions/src/prepare-request-body.ts`:
- Around line 1412-1460: The current loop over bedrockSystemMessages pushes each
array part as its own block which breaks the legacy “concatenate whole system
message” heuristic; update the handling in the collectedBedrockBlocks build so
that when sysMsg.content is an array you first inspect parts using isTextContent
and detect whether any part has an explicit cache_control: if any part has
cache_control, push each text part separately with hasExplicitCacheControl set
appropriately (preserving per-part markers), otherwise concatenate all text
parts into one string and push a single block with
hasExplicitCacheControl=false; keep the rest of the logic (systemContent,
bedrockCacheControlCount, bedrockMaxCacheControlBlocks,
bedrockMinCacheableChars) unchanged.
---
Nitpick comments:
In `@apps/gateway/src/chat/tools/transform-streaming-to-openai.ts`:
- Around line 12-33: The function normalizeAnthropicUsage currently accepts and
returns any, which hides shape drift; define a concrete input type (e.g.,
interface AnthropicUsage { input_tokens?: number; cache_creation_input_tokens?:
number; cache_read_input_tokens?: number; output_tokens?: number } ) and a
concrete return type (e.g., NormalizedUsage | null with prompt_tokens,
completion_tokens, total_tokens and optional prompt_tokens_details), change the
signature to normalizeAnthropicUsage(usage: AnthropicUsage | null | undefined):
NormalizedUsage | null, and update the implementation to use those typed fields
(keeping the same logic for defaults and conditional prompt_tokens_details); add
the new types in this file (or a nearby types file) and remove any use of any
for this helper.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c688c9ec-cbe9-41b4-acf4-b6b541dd94b6
📒 Files selected for processing (10)
apps/gateway/src/anthropic/anthropic.tsapps/gateway/src/chat-prompt-caching.e2e.tsapps/gateway/src/chat/chat.tsapps/gateway/src/chat/schemas/completions.tsapps/gateway/src/chat/tools/extract-token-usage.tsapps/gateway/src/chat/tools/parse-provider-response.tsapps/gateway/src/chat/tools/transform-response-to-openai.tsapps/gateway/src/chat/tools/transform-streaming-to-openai.tsapps/gateway/src/native-anthropic-cache.e2e.tspackages/actions/src/prepare-request-body.ts
| let usage: { | ||
| input_tokens: number; | ||
| output_tokens: number; | ||
| cache_creation_input_tokens: number; | ||
| cache_read_input_tokens: number; | ||
| } = { | ||
| input_tokens: 0, | ||
| output_tokens: 0, | ||
| cache_creation_input_tokens: 0, | ||
| cache_read_input_tokens: 0, | ||
| }; |
There was a problem hiding this comment.
Add the cache usage fields to message_start too.
This path now treats cache_creation_input_tokens and cache_read_input_tokens as always-present, but the message_start payload at Line 603 still emits usage: { input_tokens, output_tokens } only. Native streaming clients that inspect message_start.message.usage will still see undefined for the new fields.
Possible fix
usage: {
input_tokens: 0,
output_tokens: 0,
+ cache_creation_input_tokens: 0,
+ cache_read_input_tokens: 0,
},Also applies to: 592-604, 739-758
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/gateway/src/anthropic/anthropic.ts` around lines 535 - 545, The
message_start payload is only emitting usage: { input_tokens, output_tokens }
while the code elsewhere (the local usage object in anthropic.ts) now includes
cache_creation_input_tokens and cache_read_input_tokens, causing native
streaming clients to see those fields as undefined; update all places that
construct or emit message_start.message.usage (including the blocks around the
existing usage declaration and the message_start emission sites referenced) to
include cache_creation_input_tokens and cache_read_input_tokens (populated from
the same usage object or initialized to 0) so the emitted usage object
consistently has { input_tokens, output_tokens, cache_creation_input_tokens,
cache_read_input_tokens } across the codepaths.
| ...((cachedTokens !== null || | ||
| (cacheCreationTokens !== null && | ||
| cacheCreationTokens > 0)) && { | ||
| prompt_tokens_details: { | ||
| cached_tokens: cachedTokens, | ||
| cached_tokens: cachedTokens ?? 0, | ||
| ...(cacheCreationTokens !== null && | ||
| cacheCreationTokens > 0 && { | ||
| cache_creation_tokens: cacheCreationTokens, | ||
| }), |
There was a problem hiding this comment.
The normal [DONE] path still drops cache token details.
This addition only affects the late !doneSent usage chunk. In the normal unbuffered flow, Lines 5829-5870 already emit the final usage payload and Lines 5912-5918 set doneSent = true, so Anthropic/Bedrock streams still finish without prompt_tokens_details in the common case. That also leaks into the forceStream JSON adapter, since it copies usage from the streamed chunks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/gateway/src/chat/chat.ts` around lines 7110 - 7118, The final (normal
unbuffered) usage emission path still omits prompt_tokens_details, so update the
code that emits the final usage payload (the block that sets doneSent = true) to
include the same conditional spread used elsewhere: include
prompt_tokens_details when (cachedTokens !== null || (cacheCreationTokens !==
null && cacheCreationTokens > 0)) with cached_tokens: cachedTokens ?? 0 and
cache_creation_tokens when applicable; also ensure the forceStream JSON adapter
(which copies usage from streamed chunks) will receive/merge that
prompt_tokens_details by copying usage including prompt_tokens_details rather
than overwriting it. Reference prompt_tokens_details, cachedTokens,
cacheCreationTokens, doneSent and forceStream when making the changes.
| function assertCacheDiscountApplied(usage: any) { | ||
| const cachedTokens = usage?.prompt_tokens_details?.cached_tokens ?? 0; | ||
| const promptTokens = usage?.prompt_tokens ?? 0; | ||
| const uncachedTokens = promptTokens - cachedTokens; | ||
| const inputCost = usage?.cost_usd_input; | ||
| const cachedInputCost = usage?.cost_usd_cached_input; | ||
| if ( | ||
| typeof inputCost !== "number" || | ||
| typeof cachedInputCost !== "number" || | ||
| cachedTokens === 0 || | ||
| uncachedTokens === 0 | ||
| ) { | ||
| // Without both cached and uncached tokens we can't compare per-token | ||
| // rates. Skip rather than fail — the test that primes the cache will | ||
| // still verify cached_tokens > 0 separately. | ||
| return; | ||
| } | ||
| const uncachedPerToken = inputCost / uncachedTokens; | ||
| const cachedPerToken = cachedInputCost / cachedTokens; | ||
| expect( | ||
| cachedPerToken, | ||
| `expected per-token cached cost (${cachedPerToken}) to be less than per-token uncached cost (${uncachedPerToken})`, | ||
| ).toBeLessThan(uncachedPerToken); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "assertCacheDiscountApplied helper:"
sed -n '65,88p' apps/gateway/src/native-anthropic-cache.e2e.ts
echo
echo "Native call site:"
sed -n '180,205p' apps/gateway/src/native-anthropic-cache.e2e.ts
echo
echo "Native /v1/messages response usage shape:"
sed -n '848,875p' apps/gateway/src/anthropic/anthropic.tsRepository: theopenco/llmgateway
Length of output: 3161
🏁 Script executed:
# Find all calls to assertCacheDiscountApplied
rg "assertCacheDiscountApplied" apps/gateway/src/ -B 2 -A 2
echo "---"
echo "Check for type definitions related to usage shapes:"
rg "(prompt_tokens|input_tokens|cache_read|cache_creation)" apps/gateway/src/ -t ts -t tsx --iglob "*.d.ts" -lRepository: theopenco/llmgateway
Length of output: 1663
🏁 Script executed:
#!/bin/bash
# Get line numbers for all calls and surrounding context
rg -n "assertCacheDiscountApplied" apps/gateway/src/native-anthropic-cache.e2e.ts
echo "---"
echo "Check the test contexts:"
sed -n '150,210p' apps/gateway/src/native-anthropic-cache.e2e.ts | head -40
echo "---"
sed -n '250,330p' apps/gateway/src/native-anthropic-cache.e2e.ts | head -40Repository: theopenco/llmgateway
Length of output: 2714
🏁 Script executed:
#!/bin/bash
# Check the native response construction for cost fields
sed -n '840,920p' apps/gateway/src/anthropic/anthropic.tsRepository: theopenco/llmgateway
Length of output: 1584
🏁 Script executed:
#!/bin/bash
# Check if native responses include cost fields
rg "cost_usd_input|cost_usd_cached_input" apps/gateway/src/ -t ts -B 2 -A 2Repository: theopenco/llmgateway
Length of output: 15134
Remove any type and split the cache-discount assertion by response shape.
The helper assertCacheDiscountApplied at line 65 expects OpenAI-compatible response fields (prompt_tokens, prompt_tokens_details.cached_tokens, cost_usd_input, cost_usd_cached_input), but at line 202 it's called with the native Anthropic /v1/messages response, which uses input_tokens, cache_creation_input_tokens, and cache_read_input_tokens instead. The native response also lacks the cost fields. Using any silently hides this mismatch—the native call hits the early return at line 79 (uncachedTokens === 0) and skips validation entirely.
Create separate helpers for native and OpenAI-compatible responses, or provide proper typed parameters instead of any.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/gateway/src/native-anthropic-cache.e2e.ts` around lines 65 - 88, The
current assertCacheDiscountApplied(usage: any) mixes OpenAI and Anthropic shapes
and uses any; split it into two typed helpers and update call sites: create
assertCacheDiscountAppliedOpenAI(usage: { prompt_tokens: number;
prompt_tokens_details?: { cached_tokens?: number }; cost_usd_input?: number;
cost_usd_cached_input?: number }) that preserves the existing per-token cost
assertion (use cached_tokens, prompt_tokens, cost_usd_input,
cost_usd_cached_input) and create assertCacheDiscountAppliedAnthropic(usage: {
input_tokens?: number; cache_creation_input_tokens?: number;
cache_read_input_tokens?: number }) that uses Anthropic fields (treat
cache_read_input_tokens as cachedTokens and derive uncachedTokens from
input_tokens and cache_read_input_tokens) and, since cost fields are absent,
only assert that cachedTokens > 0 and uncachedTokens > 0 (skip per-token cost
comparison); replace usages of assertCacheDiscountApplied to call the
appropriate new helper and remove the any type.
| const longText = buildLongSystemPrompt(); | ||
| const body = { | ||
| model: "anthropic/claude-haiku-4-5", | ||
| max_tokens: 50, | ||
| system: [ | ||
| { | ||
| type: "text" as const, | ||
| text: longText, | ||
| cache_control: { type: "ephemeral" as const }, | ||
| }, |
There was a problem hiding this comment.
Use a below-threshold fixture for the explicit cache_control coverage.
Both of these tests still use buildLongSystemPrompt(), so the legacy length heuristic can make them pass even if cache_control is dropped somewhere in the path. Please switch the explicit-marker cases to a prompt that is intentionally shorter than the heuristic threshold so they prove the new plumbing rather than the fallback.
Also applies to: 476-487
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/gateway/src/native-anthropic-cache.e2e.ts` around lines 136 - 145,
Replace use of buildLongSystemPrompt() in the explicit cache_control test cases
with a below-threshold prompt so the tests exercise the explicit cache_control
plumbing rather than the legacy length heuristic; locate the spots where
longText = buildLongSystemPrompt() and the body.system entry includes
cache_control (the explicit-marker cases) and change to a short prompt (e.g.,
buildShortSystemPrompt() or a hardcoded short string) that is deliberately
shorter than the heuristic threshold, and make the same change at the second
occurrence referenced around the other block (the lines near 476-487) to ensure
both explicit cache_control tests use the short fixture.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/gateway/src/anthropic/anthropic.ts (1)
330-339:⚠️ Potential issue | 🟠 MajorMixed tool messages still drop text-block cache markers.
These branches always collapse the remaining text blocks into a plain string, so any explicit
cache_controlattached to text alongsidetool_useortool_resultis discarded before the inner/v1/chat/completionshop. Reusing the same block-preserving logic as Lines 395-428 here would close that gap.Also applies to: 380-389
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/anthropic/anthropic.ts` around lines 330 - 339, The current branch collapses mixed message.content into a plain string (via textContent) and loses any cache_control markers; update the openaiMessages push for mixed tool/text messages so it preserves block objects (including cache_control) instead of joining to a string—replace the textContent construction and the content: textContent || "" assignment in the openaiMessages.push with the same block-preserving logic used elsewhere in this file (the logic that maps message.content to an array of blocks preserving type, text, and cache_control), ensuring tool_calls is still attached; apply the same fix to the other similar branch referenced in the comment.
♻️ Duplicate comments (1)
apps/gateway/src/anthropic/anthropic.ts (1)
595-607:⚠️ Potential issue | 🟠 Major
message_startstill omits the new cache usage fields.Line 606 still emits only
{ input_tokens, output_tokens }, so native streaming clients inspectingmessage_start.message.usagewill not seecache_creation_input_tokensorcache_read_input_tokensuntil the later delta.💡 Possible fix
usage: { input_tokens: 0, output_tokens: 0, + cache_creation_input_tokens: 0, + cache_read_input_tokens: 0, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/anthropic/anthropic.ts` around lines 595 - 607, The message_start event emitted in the stream.writeSSE call for the assistant message (the block constructing message with id = messageId, role = "assistant", model = model) currently sets usage to only { input_tokens, output_tokens }; update that usage object in the message_start payload to include cache_creation_input_tokens and cache_read_input_tokens (initialize them to 0 like the other token counters) so native streaming clients see the full usage shape immediately. Locate the stream.writeSSE invocation that builds the "message_start" payload and add the two cache usage fields to message.message.usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@apps/gateway/src/anthropic/anthropic.ts`:
- Around line 330-339: The current branch collapses mixed message.content into a
plain string (via textContent) and loses any cache_control markers; update the
openaiMessages push for mixed tool/text messages so it preserves block objects
(including cache_control) instead of joining to a string—replace the textContent
construction and the content: textContent || "" assignment in the
openaiMessages.push with the same block-preserving logic used elsewhere in this
file (the logic that maps message.content to an array of blocks preserving type,
text, and cache_control), ensuring tool_calls is still attached; apply the same
fix to the other similar branch referenced in the comment.
---
Duplicate comments:
In `@apps/gateway/src/anthropic/anthropic.ts`:
- Around line 595-607: The message_start event emitted in the stream.writeSSE
call for the assistant message (the block constructing message with id =
messageId, role = "assistant", model = model) currently sets usage to only {
input_tokens, output_tokens }; update that usage object in the message_start
payload to include cache_creation_input_tokens and cache_read_input_tokens
(initialize them to 0 like the other token counters) so native streaming clients
see the full usage shape immediately. Locate the stream.writeSSE invocation that
builds the "message_start" payload and add the two cache usage fields to
message.message.usage.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b5186f0b-9416-417f-86eb-a1edba018268
📒 Files selected for processing (1)
apps/gateway/src/anthropic/anthropic.ts
Place a cache_control / cachePoint marker on the last content block of
the message just before the final user turn. This caches the entire
conversation prefix (all prior turns) instead of only caching individual
text blocks that exceed a length threshold.
Before: only the system prompt was cached (~16k tokens), and the
conversation history (~100k+ tokens) was reprocessed on every request.
After: the entire prefix up to the previous turn is cached, so only the
new user message and the model's response are uncached. This
dramatically improves the cache hit ratio for long multi-turn
conversations (e.g. Claude Code sessions).
Applied to both Anthropic (cache_control: {type: "ephemeral"}) and AWS
Bedrock (cachePoint: {type: "default"}) paths, respecting the existing
4-block limit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/actions/src/transform-anthropic-messages.ts (1)
124-139:⚠️ Potential issue | 🟠 MajorCount preserved
cache_controlblocks before adding new ones.Lines 124-139 return caller-supplied
cache_controltext parts unchanged, but they never incrementcacheControlCount. That lets Lines 296-330 append a turn-boundary marker even when four explicit markers are already present, which can push Anthropic requests past the 4-block cap.Proposed fix
- if (isTextContent(part) && part.text && !part.cache_control) { - // Automatically add cache_control for long text blocks - const shouldCache = - shouldApplyCacheControl && - part.text.length >= minCacheableChars && - cacheControlCount < maxCacheControlBlocks; - if (shouldCache) { - cacheControlCount++; - return { - ...part, - cache_control: { type: "ephemeral" }, - }; - } + if (isTextContent(part) && part.text) { + if (part.cache_control) { + if (cacheControlCount < maxCacheControlBlocks) { + cacheControlCount++; + return part; + } + const { cache_control: _ignored, ...rest } = part; + return rest; + } + // Automatically add cache_control for long text blocks + const shouldCache = + shouldApplyCacheControl && + part.text.length >= minCacheableChars && + cacheControlCount < maxCacheControlBlocks; + if (shouldCache) { + cacheControlCount++; + return { + ...part, + cache_control: { type: "ephemeral" }, + }; + } }Also applies to: 296-330
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/actions/src/transform-anthropic-messages.ts` around lines 124 - 139, The code that iterates message parts (using cacheControlCount, shouldApplyCacheControl, minCacheableChars, maxCacheControlBlocks) currently skips incrementing cacheControlCount when a caller-supplied part already has a cache_control field, allowing the code later that appends a turn-boundary marker to exceed Anthropic's 4-block cap; modify the parts-mapping logic to detect existing part.cache_control and increment cacheControlCount when present (and similarly update the analogous logic around the turn-boundary appending code that uses the same variables) so both caller-provided and auto-added cache_control blocks are counted before adding any additional markers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/actions/src/prepare-request-body.ts`:
- Around line 1170-1249: The current logic flips a global callerSetCacheControl
mode which disables heuristic caching for all system blocks if any block
contains an explicit cache_control; instead, iterate per system message and for
each sysMsg detect whether that specific sysMsg (via
Array.isArray(sysMsg.content) && sysMsg.content.some(c => isTextContent(c) &&
!!c.cache_control)) carries explicit cache_control; if a sysMsg has explicit
markers, preserve those explicit cache_control entries verbatim (respecting
systemCacheControlCount and maxCacheControlBlocks), otherwise treat that single
sysMsg with the legacy concatenation+heuristic path (using minCacheableChars and
incrementing systemCacheControlCount when you apply an ephemeral cache_control),
updating systemContent accordingly; adjust/remove the global
callerSetCacheControl branch and use per-message logic around systemMessages,
systemContent, systemCacheControlCount, maxCacheControlBlocks,
minCacheableChars, isTextContent and cache_control.
---
Outside diff comments:
In `@packages/actions/src/transform-anthropic-messages.ts`:
- Around line 124-139: The code that iterates message parts (using
cacheControlCount, shouldApplyCacheControl, minCacheableChars,
maxCacheControlBlocks) currently skips incrementing cacheControlCount when a
caller-supplied part already has a cache_control field, allowing the code later
that appends a turn-boundary marker to exceed Anthropic's 4-block cap; modify
the parts-mapping logic to detect existing part.cache_control and increment
cacheControlCount when present (and similarly update the analogous logic around
the turn-boundary appending code that uses the same variables) so both
caller-provided and auto-added cache_control blocks are counted before adding
any additional markers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f54427df-d6a3-478a-8042-6ebfea4ef69b
📒 Files selected for processing (3)
packages/actions/src/prepare-request-body.spec.tspackages/actions/src/prepare-request-body.tspackages/actions/src/transform-anthropic-messages.ts
| // Detect whether any text block in the incoming system messages has | ||
| // a caller-supplied cache_control marker. If so, we preserve the | ||
| // per-block structure so we can forward markers verbatim. Otherwise | ||
| // we fall back to the legacy behavior of concatenating each system | ||
| // message's text into a single block (and applying the length-based | ||
| // heuristic per concatenated block). | ||
| const callerSetCacheControl = systemMessages.some((sysMsg) => { | ||
| if (!Array.isArray(sysMsg.content)) { | ||
| return false; | ||
| } | ||
| return sysMsg.content.some( | ||
| (c) => isTextContent(c) && !!c.cache_control, | ||
| ); | ||
| }); | ||
|
|
||
| if (!text || text.trim() === "") { | ||
| continue; | ||
| if (callerSetCacheControl) { | ||
| for (const sysMsg of systemMessages) { | ||
| if (typeof sysMsg.content === "string") { | ||
| if (!sysMsg.content.trim()) { | ||
| continue; | ||
| } | ||
| systemContent.push({ type: "text", text: sysMsg.content }); | ||
| } else if (Array.isArray(sysMsg.content)) { | ||
| for (const part of sysMsg.content) { | ||
| if (!isTextContent(part) || !part.text || !part.text.trim()) { | ||
| continue; | ||
| } | ||
| const explicit = part.cache_control; | ||
| if (explicit) { | ||
| if (systemCacheControlCount < maxCacheControlBlocks) { | ||
| systemCacheControlCount++; | ||
| systemContent.push({ | ||
| type: "text", | ||
| text: part.text, | ||
| cache_control: explicit, | ||
| }); | ||
| } else { | ||
| systemContent.push({ type: "text", text: part.text }); | ||
| } | ||
| } else { | ||
| systemContent.push({ type: "text", text: part.text }); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } else { | ||
| for (const sysMsg of systemMessages) { | ||
| let text: string; | ||
| if (typeof sysMsg.content === "string") { | ||
| text = sysMsg.content; | ||
| } else if (Array.isArray(sysMsg.content)) { | ||
| // Concatenate text from array content (legacy behavior). | ||
| text = sysMsg.content | ||
| .filter((c) => c.type === "text" && "text" in c) | ||
| .map((c) => (c as { type: "text"; text: string }).text) | ||
| .join(""); | ||
| } else { | ||
| continue; | ||
| } | ||
|
|
||
| // Add cache_control for text blocks exceeding the model's minimum cacheable threshold | ||
| const shouldCache = | ||
| text.length >= minCacheableChars && | ||
| systemCacheControlCount < maxCacheControlBlocks; | ||
|
|
||
| if (shouldCache) { | ||
| systemCacheControlCount++; | ||
| systemContent.push({ | ||
| type: "text", | ||
| text, | ||
| cache_control: { type: "ephemeral" }, | ||
| }); | ||
| } else { | ||
| systemContent.push({ | ||
| type: "text", | ||
| text, | ||
| }); | ||
| if (!text || text.trim() === "") { | ||
| continue; | ||
| } | ||
|
|
||
| const shouldCache = | ||
| text.length >= minCacheableChars && | ||
| systemCacheControlCount < maxCacheControlBlocks; | ||
|
|
||
| if (shouldCache) { | ||
| systemCacheControlCount++; | ||
| systemContent.push({ | ||
| type: "text", | ||
| text, | ||
| cache_control: { type: "ephemeral" }, | ||
| }); | ||
| } else { | ||
| systemContent.push({ type: "text", text }); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Don't disable heuristic caching for every system block once one explicit marker appears.
Both branches switch into an all-or-nothing mode via callerSetCacheControl / callerSetBedrockCacheControl. With mixed inputs, an unmarked long system block stops getting heuristic caching just because some other system block had explicit cache_control. That breaks the documented “preserve explicit markers, fall back when absent” behavior.
Also applies to: 1436-1454
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/actions/src/prepare-request-body.ts` around lines 1170 - 1249, The
current logic flips a global callerSetCacheControl mode which disables heuristic
caching for all system blocks if any block contains an explicit cache_control;
instead, iterate per system message and for each sysMsg detect whether that
specific sysMsg (via Array.isArray(sysMsg.content) && sysMsg.content.some(c =>
isTextContent(c) && !!c.cache_control)) carries explicit cache_control; if a
sysMsg has explicit markers, preserve those explicit cache_control entries
verbatim (respecting systemCacheControlCount and maxCacheControlBlocks),
otherwise treat that single sysMsg with the legacy concatenation+heuristic path
(using minCacheableChars and incrementing systemCacheControlCount when you apply
an ephemeral cache_control), updating systemContent accordingly; adjust/remove
the global callerSetCacheControl branch and use per-message logic around
systemMessages, systemContent, systemCacheControlCount, maxCacheControlBlocks,
minCacheableChars, isTextContent and cache_control.
Summary
Addresses Luca's ask: "we need to make sure the requests make use of provider token caching and return the cache tokens, which is sometimes the cache but in general it doesn't really work well at all" — for Anthropic and AWS Bedrock.
What changed
Inbound (
cache_controlpass-through)cache_control: { type: "ephemeral" }field on text content parts in completions.ts lets clients opt into Anthropic prompt caching from the OpenAI-compat endpointcache_controlon system + user text blocks for Anthropic, and maps them tocachePointblocks for AWS Bedrock. Falls back to the existing length-based heuristic when no marker is providedcache_controlfrom inbound system + messages and forwards them through the inner chat completions pathcache_controlfrom text parts when routing to non-Anthropic / non-Bedrock providers so OpenAI / Google / etc. don't receive an unknown fieldOutbound (cache token surfacing)
cacheCreationTokensfor both Anthropic and Bedrock (non-streaming + streaming)cache_creation_tokensalongsidecached_tokensinprompt_tokens_detailson the openai-compat responsenormalizeAnthropicUsagehelper so streamed Anthropic chunks expose cache token usage in OpenAI shapecache_creation_input_tokens+cache_read_input_tokens(set to 0 when inapplicable, matching Anthropic's actual API), and converts back from the inner chat completions response shape on both streaming and non-streaming pathsTest coverage (8 new e2e tests in native-anthropic-cache.e2e.ts)
cache_control(Anthropic)cache_controlon /v1/chat/completions (the new schema field)Each cache-read assertion uses retry-with-backoff because Anthropic prompt cache writes are eventually consistent and back-to-back requests can occasionally miss. Also extends chat-prompt-caching.e2e.ts with the same retry pattern.
Test plan
anthropic/claude-haiku-4-5andaws-bedrock/claude-haiku-4-5packages/actionspasschat-prompt-caching.e2e.ts(gated onTEST_CACHE_MODE=true) passes for both Anthropic and Bedrockcache_controlsent to OpenAI returns 200 (field is stripped, not forwarded)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes / Improvements
Tests