Skip to content

fix: surface anthropic + bedrock prompt cache tokens#1992

Open
rcogal wants to merge 3 commits intomainfrom
fix/anthropic-bedrock-prompt-caching
Open

fix: surface anthropic + bedrock prompt cache tokens#1992
rcogal wants to merge 3 commits intomainfrom
fix/anthropic-bedrock-prompt-caching

Conversation

@rcogal
Copy link
Copy Markdown
Collaborator

@rcogal rcogal commented Apr 8, 2026

Summary

Addresses Luca's ask: "we need to make sure the requests make use of provider token caching and return the cache tokens, which is sometimes the cache but in general it doesn't really work well at all" — for Anthropic and AWS Bedrock.

What changed

Inbound (cache_control pass-through)

  • New optional cache_control: { type: "ephemeral" } field on text content parts in completions.ts lets clients opt into Anthropic prompt caching from the OpenAI-compat endpoint
  • prepare-request-body.ts preserves caller-supplied cache_control on system + user text blocks for Anthropic, and maps them to cachePoint blocks for AWS Bedrock. Falls back to the existing length-based heuristic when no marker is provided
  • Native /v1/messages preserves per-block cache_control from inbound system + messages and forwards them through the inner chat completions path
  • Strips cache_control from text parts when routing to non-Anthropic / non-Bedrock providers so OpenAI / Google / etc. don't receive an unknown field

Outbound (cache token surfacing)

  • parse-provider-response.ts and extract-token-usage.ts now extract cacheCreationTokens for both Anthropic and Bedrock (non-streaming + streaming)
  • transform-response-to-openai.ts and chat.ts surface cache_creation_tokens alongside cached_tokens in prompt_tokens_details on the openai-compat response
  • transform-streaming-to-openai.ts gets a new normalizeAnthropicUsage helper so streamed Anthropic chunks expose cache token usage in OpenAI shape
  • Native /v1/messages always emits cache_creation_input_tokens + cache_read_input_tokens (set to 0 when inapplicable, matching Anthropic's actual API), and converts back from the inner chat completions response shape on both streaming and non-streaming paths

Test coverage (8 new e2e tests in native-anthropic-cache.e2e.ts)

  • Native /v1/messages with explicit cache_control (Anthropic)
  • openai-compat /v1/chat/completions with long system prompt (Anthropic, length-heuristic path)
  • openai-compat /v1/chat/completions with long system prompt (Bedrock)
  • Streaming /v1/chat/completions (Anthropic + Bedrock)
  • Explicit cache_control on /v1/chat/completions (the new schema field)
  • Cost discount assertion: per-token cached cost < per-token uncached cost within a single response (validates the gateway honors Anthropic's ~10% cache discount in billing)

Each cache-read assertion uses retry-with-backoff because Anthropic prompt cache writes are eventually consistent and back-to-back requests can occasionally miss. Also extends chat-prompt-caching.e2e.ts with the same retry pattern.

Test plan

  • All 8 new e2e tests pass against anthropic/claude-haiku-4-5 and aws-bedrock/claude-haiku-4-5
  • All 55 unit tests in packages/actions pass
  • Existing chat-prompt-caching.e2e.ts (gated on TEST_CACHE_MODE=true) passes for both Anthropic and Bedrock
  • Manual smoke: cache_control sent to OpenAI returns 200 (field is stripped, not forwarded)
  • Cost discount assertion confirms cached tokens are billed at the discounted rate (~10% of normal input price)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Preserve per-block cache control for text content and forward explicit cache markers to supported providers; schema allows ephemeral cache_control on text blocks.
    • Add turn-boundary cache hinting and Bedrock cachePoint insertion to improve cache priming.
  • Bug Fixes / Improvements

    • Emit richer cache-related usage (cached and cache-creation token breakdowns) and compute input tokens excluding cached tokens.
    • Strip unsupported cache markers when sending to non-native providers.
  • Tests

    • Added e2e suites and retry logic validating prompt caching across endpoints and streaming.

Anthropic and Bedrock prompt caching pass-through was unreliable: client-
supplied cache_control markers were partially handled, cache token usage
was inconsistently surfaced through the openai-compat and native /v1/messages
paths, and there was no e2e coverage for streaming or bedrock.

This change:

- Honors caller-supplied cache_control on text content parts in
  /v1/chat/completions (new optional schema field) and forwards them
  verbatim to Anthropic, mapping to cachePoint blocks for Bedrock. Falls
  back to the existing length-based heuristic when no marker is provided.
- Preserves cache_control on system + message text blocks coming through
  the native /v1/messages endpoint, and surfaces cache_creation_input_tokens
  / cache_read_input_tokens on responses (always emitted, set to 0 when
  inapplicable, matching Anthropic's actual API).
- Surfaces cache_creation_tokens alongside cached_tokens in
  prompt_tokens_details on the openai-compat response, including streaming
  chunks via a new normalizeAnthropicUsage helper.
- Strips cache_control from text parts when routing to non-Anthropic /
  non-Bedrock providers so OpenAI/Google/etc. don't receive an unknown
  field.
- Adds end-to-end tests covering: native /v1/messages with explicit
  cache_control, openai-compat for both Anthropic and Bedrock, streaming
  for both, and explicit cache_control on /v1/chat/completions. Each
  asserts cached_tokens > 0 after a retry-with-backoff (Anthropic prompt
  cache writes are eventually consistent), and where applicable asserts
  the per-token cached cost is strictly less than the per-token uncached
  cost within the same response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: surface anthropic + bedrock prompt cache tokens' accurately and concisely summarizes the main objective of the PR: implementing and surfacing prompt cache token support for Anthropic and AWS Bedrock providers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/anthropic-bedrock-prompt-caching

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Make cache_creation_input_tokens and cache_read_input_tokens optional
with a default of 0 in anthropicResponseSchema. Anthropic emits these
on caching-supported models today, but a non-optional schema would fail
validation if an older Claude model, a beta endpoint, or a future API
change ever omits them — turning a graceful "no caching info" into a 500.

The downstream conversion code already handles 0 correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
packages/actions/src/prepare-request-body.ts (1)

1412-1460: ⚠️ Potential issue | 🟠 Major

The Bedrock fallback path no longer preserves legacy system-message concatenation.

When no explicit cache_control is present, this still collects array-based system content one text part at a time. That changes the heuristic from “cache the whole system message if its combined text is long enough” to “cache each part independently”, so a long multipart system prompt can now miss cachePoint entirely. Preserve per-part handling only in the explicit-marker path; otherwise concatenate each system message’s text first. Using isTextContent here would also remove the as any[] escape hatch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/actions/src/prepare-request-body.ts` around lines 1412 - 1460, The
current loop over bedrockSystemMessages pushes each array part as its own block
which breaks the legacy “concatenate whole system message” heuristic; update the
handling in the collectedBedrockBlocks build so that when sysMsg.content is an
array you first inspect parts using isTextContent and detect whether any part
has an explicit cache_control: if any part has cache_control, push each text
part separately with hasExplicitCacheControl set appropriately (preserving
per-part markers), otherwise concatenate all text parts into one string and push
a single block with hasExplicitCacheControl=false; keep the rest of the logic
(systemContent, bedrockCacheControlCount, bedrockMaxCacheControlBlocks,
bedrockMinCacheableChars) unchanged.
apps/gateway/src/chat/tools/transform-streaming-to-openai.ts (1)

1221-1248: ⚠️ Potential issue | 🟠 Major

Bedrock streaming drops cache_creation_tokens on cache writes.

cacheWriteTokens is included in prompt_tokens, but prompt_tokens_details is only emitted when cacheReadTokens > 0. A write-only cache hit in streaming mode therefore loses the new metric even though the non-streaming parser now returns it.

🐛 Proposed fix
 					usage: {
 						prompt_tokens: promptTokens,
 						completion_tokens: data.usage.outputTokens ?? 0,
 						total_tokens: data.usage.totalTokens ?? 0,
-						...(cacheReadTokens > 0 && {
+						...((cacheReadTokens > 0 || cacheWriteTokens > 0) && {
 							prompt_tokens_details: {
 								cached_tokens: cacheReadTokens,
+								...(cacheWriteTokens > 0 && {
+									cache_creation_tokens: cacheWriteTokens,
+								}),
 							},
 						}),
 					},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/tools/transform-streaming-to-openai.ts` around lines
1221 - 1248, The streaming metadata branch builds transformedData but only emits
prompt_tokens_details when cacheReadTokens > 0, which drops
cache_creation_tokens for write-only cache events; update the logic in the
eventType === "metadata" handling (around transformedData construction) to
include prompt_tokens_details whenever cacheWriteTokens > 0 (and include
cached_tokens or cache_creation_tokens as appropriate) or when either
cacheReadTokens > 0 or cacheWriteTokens > 0, ensuring cacheWriteTokens is
represented in prompt_tokens_details and that prompt_tokens still sums
inputTokens + cacheReadTokens + cacheWriteTokens.
apps/gateway/src/chat/chat.ts (1)

8875-8912: ⚠️ Potential issue | 🟠 Major

Update the documented /completions response schema before emitting cache_creation_tokens.

transformResponseToOpenai(...) can now populate usage.prompt_tokens_details.cache_creation_tokens, but the 200 schema above still declares prompt_tokens_details as only { cached_tokens }. OpenAPI docs and generated clients will stay out of sync with the actual response shape.

📘 Suggested schema update
 							prompt_tokens_details: z
 								.object({
 									cached_tokens: z.number(),
+									cache_creation_tokens: z.number().optional(),
 								})
 								.optional(),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/chat.ts` around lines 8875 - 8912, The OpenAPI response
schema for the /completions endpoint is missing the new field
usage.prompt_tokens_details.cache_creation_tokens that
transformResponseToOpenai(...) can now populate; update the documented 200
response schema to add prompt_tokens_details.cache_creation_tokens (an
integer/number, nullable if appropriate) alongside the existing cached_tokens
entry so the emitted JSON shape matches what transformResponseToOpenai returns,
and regenerate/update any client types or schema references that rely on that
response definition.
apps/gateway/src/chat/tools/extract-token-usage.ts (1)

102-115: ⚠️ Potential issue | 🟠 Major

Don’t coerce omitted cache counters to 0.

These branches turn “field omitted in this frame” into “provider reported zero”. Because chat.ts overwrites the running values whenever extractTokenUsage() returns non-null on Lines 6571-6576, a later partial usage frame can erase an earlier non-zero cacheCreationTokens/cachedTokens value before the final response is built. Preserve null for absent cache fields and only rebuild the prompt-side counters when those input counters were actually present.

Also applies to: 118-133

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/tools/extract-token-usage.ts` around lines 102 - 115,
The aws-bedrock branch in extractTokenUsage currently coerces missing cache
fields to 0 (inputTokens/cacheReadTokens/cacheWriteTokens), which can overwrite
prior non-null running counters; instead, leave these as null when absent (e.g.,
set inputTokens/cacheReadTokens/cacheWriteTokens to null if undefined) and only
compute promptTokens/prompt-side sums when the contributing values are actually
present; preserve cachedTokens and cacheCreationTokens as null if their source
fields are absent so later partial frames cannot zero-out earlier values; apply
the same change to the other branch referenced (lines ~118-133) that handles
similar cache fields.
apps/gateway/src/anthropic/anthropic.ts (1)

340-387: ⚠️ Potential issue | 🟠 Major

cache_control is still dropped on tool_result user turns.

The new preservation logic at Lines 397-425 never runs for user messages that hit the special-case branch at Lines 340-387. That branch still collapses the remaining text blocks into a plain string, so a payload like [text(cache_control), tool_result] loses its explicit cache marker before it reaches /v1/chat/completions.

Also applies to: 397-425

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/anthropic/anthropic.ts` around lines 340 - 387, The
special-case branch that handles message.content with tool_result collapses
remaining text blocks into a single string (the textContent construction and
subsequent openaiMessages.push), which drops any cache_control markers; instead
preserve text blocks (including cache_control) by collecting the text blocks as
their original block objects rather than joining to a string and push a user
message whose content is the array of blocks (or otherwise carry cache_control
metadata) so the downstream preservation logic that expects block objects can
run; update the code around toolResults, the textContent creation, and the
openaiMessages.push for role:"user" to forward the blocks unchanged (referencing
message.content, toolResults, combinedContent, and the openaiMessages pushes).
🧹 Nitpick comments (1)
apps/gateway/src/chat/tools/transform-streaming-to-openai.ts (1)

12-33: Give normalizeAnthropicUsage a concrete type.

This helper only reads a small, fixed usage shape, so any hides field-name drift in a pretty central transform path.

♻️ Proposed refactor
+type AnthropicUsage = {
+	input_tokens?: number;
+	cache_creation_input_tokens?: number;
+	cache_read_input_tokens?: number;
+	output_tokens?: number;
+};
+
-function normalizeAnthropicUsage(usage: any): any {
+function normalizeAnthropicUsage(usage: AnthropicUsage | null | undefined) {

As per coding guidelines, "Never use any or as any in TypeScript unless absolutely necessary".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/tools/transform-streaming-to-openai.ts` around lines 12
- 33, The function normalizeAnthropicUsage currently accepts and returns any,
which hides shape drift; define a concrete input type (e.g., interface
AnthropicUsage { input_tokens?: number; cache_creation_input_tokens?: number;
cache_read_input_tokens?: number; output_tokens?: number } ) and a concrete
return type (e.g., NormalizedUsage | null with prompt_tokens, completion_tokens,
total_tokens and optional prompt_tokens_details), change the signature to
normalizeAnthropicUsage(usage: AnthropicUsage | null | undefined):
NormalizedUsage | null, and update the implementation to use those typed fields
(keeping the same logic for defaults and conditional prompt_tokens_details); add
the new types in this file (or a nearby types file) and remove any use of any
for this helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/gateway/src/anthropic/anthropic.ts`:
- Around line 535-545: The message_start payload is only emitting usage: {
input_tokens, output_tokens } while the code elsewhere (the local usage object
in anthropic.ts) now includes cache_creation_input_tokens and
cache_read_input_tokens, causing native streaming clients to see those fields as
undefined; update all places that construct or emit message_start.message.usage
(including the blocks around the existing usage declaration and the
message_start emission sites referenced) to include cache_creation_input_tokens
and cache_read_input_tokens (populated from the same usage object or initialized
to 0) so the emitted usage object consistently has { input_tokens,
output_tokens, cache_creation_input_tokens, cache_read_input_tokens } across the
codepaths.

In `@apps/gateway/src/chat/chat.ts`:
- Around line 7110-7118: The final (normal unbuffered) usage emission path still
omits prompt_tokens_details, so update the code that emits the final usage
payload (the block that sets doneSent = true) to include the same conditional
spread used elsewhere: include prompt_tokens_details when (cachedTokens !== null
|| (cacheCreationTokens !== null && cacheCreationTokens > 0)) with
cached_tokens: cachedTokens ?? 0 and cache_creation_tokens when applicable; also
ensure the forceStream JSON adapter (which copies usage from streamed chunks)
will receive/merge that prompt_tokens_details by copying usage including
prompt_tokens_details rather than overwriting it. Reference
prompt_tokens_details, cachedTokens, cacheCreationTokens, doneSent and
forceStream when making the changes.

In `@apps/gateway/src/native-anthropic-cache.e2e.ts`:
- Around line 136-145: Replace use of buildLongSystemPrompt() in the explicit
cache_control test cases with a below-threshold prompt so the tests exercise the
explicit cache_control plumbing rather than the legacy length heuristic; locate
the spots where longText = buildLongSystemPrompt() and the body.system entry
includes cache_control (the explicit-marker cases) and change to a short prompt
(e.g., buildShortSystemPrompt() or a hardcoded short string) that is
deliberately shorter than the heuristic threshold, and make the same change at
the second occurrence referenced around the other block (the lines near 476-487)
to ensure both explicit cache_control tests use the short fixture.
- Around line 65-88: The current assertCacheDiscountApplied(usage: any) mixes
OpenAI and Anthropic shapes and uses any; split it into two typed helpers and
update call sites: create assertCacheDiscountAppliedOpenAI(usage: {
prompt_tokens: number; prompt_tokens_details?: { cached_tokens?: number };
cost_usd_input?: number; cost_usd_cached_input?: number }) that preserves the
existing per-token cost assertion (use cached_tokens, prompt_tokens,
cost_usd_input, cost_usd_cached_input) and create
assertCacheDiscountAppliedAnthropic(usage: { input_tokens?: number;
cache_creation_input_tokens?: number; cache_read_input_tokens?: number }) that
uses Anthropic fields (treat cache_read_input_tokens as cachedTokens and derive
uncachedTokens from input_tokens and cache_read_input_tokens) and, since cost
fields are absent, only assert that cachedTokens > 0 and uncachedTokens > 0
(skip per-token cost comparison); replace usages of assertCacheDiscountApplied
to call the appropriate new helper and remove the any type.

---

Outside diff comments:
In `@apps/gateway/src/anthropic/anthropic.ts`:
- Around line 340-387: The special-case branch that handles message.content with
tool_result collapses remaining text blocks into a single string (the
textContent construction and subsequent openaiMessages.push), which drops any
cache_control markers; instead preserve text blocks (including cache_control) by
collecting the text blocks as their original block objects rather than joining
to a string and push a user message whose content is the array of blocks (or
otherwise carry cache_control metadata) so the downstream preservation logic
that expects block objects can run; update the code around toolResults, the
textContent creation, and the openaiMessages.push for role:"user" to forward the
blocks unchanged (referencing message.content, toolResults, combinedContent, and
the openaiMessages pushes).

In `@apps/gateway/src/chat/chat.ts`:
- Around line 8875-8912: The OpenAPI response schema for the /completions
endpoint is missing the new field
usage.prompt_tokens_details.cache_creation_tokens that
transformResponseToOpenai(...) can now populate; update the documented 200
response schema to add prompt_tokens_details.cache_creation_tokens (an
integer/number, nullable if appropriate) alongside the existing cached_tokens
entry so the emitted JSON shape matches what transformResponseToOpenai returns,
and regenerate/update any client types or schema references that rely on that
response definition.

In `@apps/gateway/src/chat/tools/extract-token-usage.ts`:
- Around line 102-115: The aws-bedrock branch in extractTokenUsage currently
coerces missing cache fields to 0
(inputTokens/cacheReadTokens/cacheWriteTokens), which can overwrite prior
non-null running counters; instead, leave these as null when absent (e.g., set
inputTokens/cacheReadTokens/cacheWriteTokens to null if undefined) and only
compute promptTokens/prompt-side sums when the contributing values are actually
present; preserve cachedTokens and cacheCreationTokens as null if their source
fields are absent so later partial frames cannot zero-out earlier values; apply
the same change to the other branch referenced (lines ~118-133) that handles
similar cache fields.

In `@apps/gateway/src/chat/tools/transform-streaming-to-openai.ts`:
- Around line 1221-1248: The streaming metadata branch builds transformedData
but only emits prompt_tokens_details when cacheReadTokens > 0, which drops
cache_creation_tokens for write-only cache events; update the logic in the
eventType === "metadata" handling (around transformedData construction) to
include prompt_tokens_details whenever cacheWriteTokens > 0 (and include
cached_tokens or cache_creation_tokens as appropriate) or when either
cacheReadTokens > 0 or cacheWriteTokens > 0, ensuring cacheWriteTokens is
represented in prompt_tokens_details and that prompt_tokens still sums
inputTokens + cacheReadTokens + cacheWriteTokens.

In `@packages/actions/src/prepare-request-body.ts`:
- Around line 1412-1460: The current loop over bedrockSystemMessages pushes each
array part as its own block which breaks the legacy “concatenate whole system
message” heuristic; update the handling in the collectedBedrockBlocks build so
that when sysMsg.content is an array you first inspect parts using isTextContent
and detect whether any part has an explicit cache_control: if any part has
cache_control, push each text part separately with hasExplicitCacheControl set
appropriately (preserving per-part markers), otherwise concatenate all text
parts into one string and push a single block with
hasExplicitCacheControl=false; keep the rest of the logic (systemContent,
bedrockCacheControlCount, bedrockMaxCacheControlBlocks,
bedrockMinCacheableChars) unchanged.

---

Nitpick comments:
In `@apps/gateway/src/chat/tools/transform-streaming-to-openai.ts`:
- Around line 12-33: The function normalizeAnthropicUsage currently accepts and
returns any, which hides shape drift; define a concrete input type (e.g.,
interface AnthropicUsage { input_tokens?: number; cache_creation_input_tokens?:
number; cache_read_input_tokens?: number; output_tokens?: number } ) and a
concrete return type (e.g., NormalizedUsage | null with prompt_tokens,
completion_tokens, total_tokens and optional prompt_tokens_details), change the
signature to normalizeAnthropicUsage(usage: AnthropicUsage | null | undefined):
NormalizedUsage | null, and update the implementation to use those typed fields
(keeping the same logic for defaults and conditional prompt_tokens_details); add
the new types in this file (or a nearby types file) and remove any use of any
for this helper.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c688c9ec-cbe9-41b4-acf4-b6b541dd94b6

📥 Commits

Reviewing files that changed from the base of the PR and between 6b064a1 and c89fa4b.

📒 Files selected for processing (10)
  • apps/gateway/src/anthropic/anthropic.ts
  • apps/gateway/src/chat-prompt-caching.e2e.ts
  • apps/gateway/src/chat/chat.ts
  • apps/gateway/src/chat/schemas/completions.ts
  • apps/gateway/src/chat/tools/extract-token-usage.ts
  • apps/gateway/src/chat/tools/parse-provider-response.ts
  • apps/gateway/src/chat/tools/transform-response-to-openai.ts
  • apps/gateway/src/chat/tools/transform-streaming-to-openai.ts
  • apps/gateway/src/native-anthropic-cache.e2e.ts
  • packages/actions/src/prepare-request-body.ts

Comment on lines +535 to +545
let usage: {
input_tokens: number;
output_tokens: number;
cache_creation_input_tokens: number;
cache_read_input_tokens: number;
} = {
input_tokens: 0,
output_tokens: 0,
cache_creation_input_tokens: 0,
cache_read_input_tokens: 0,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add the cache usage fields to message_start too.

This path now treats cache_creation_input_tokens and cache_read_input_tokens as always-present, but the message_start payload at Line 603 still emits usage: { input_tokens, output_tokens } only. Native streaming clients that inspect message_start.message.usage will still see undefined for the new fields.

Possible fix
  usage: {
  	input_tokens: 0,
  	output_tokens: 0,
+ 	cache_creation_input_tokens: 0,
+ 	cache_read_input_tokens: 0,
  },

Also applies to: 592-604, 739-758

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/anthropic/anthropic.ts` around lines 535 - 545, The
message_start payload is only emitting usage: { input_tokens, output_tokens }
while the code elsewhere (the local usage object in anthropic.ts) now includes
cache_creation_input_tokens and cache_read_input_tokens, causing native
streaming clients to see those fields as undefined; update all places that
construct or emit message_start.message.usage (including the blocks around the
existing usage declaration and the message_start emission sites referenced) to
include cache_creation_input_tokens and cache_read_input_tokens (populated from
the same usage object or initialized to 0) so the emitted usage object
consistently has { input_tokens, output_tokens, cache_creation_input_tokens,
cache_read_input_tokens } across the codepaths.

Comment on lines +7110 to +7118
...((cachedTokens !== null ||
(cacheCreationTokens !== null &&
cacheCreationTokens > 0)) && {
prompt_tokens_details: {
cached_tokens: cachedTokens,
cached_tokens: cachedTokens ?? 0,
...(cacheCreationTokens !== null &&
cacheCreationTokens > 0 && {
cache_creation_tokens: cacheCreationTokens,
}),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The normal [DONE] path still drops cache token details.

This addition only affects the late !doneSent usage chunk. In the normal unbuffered flow, Lines 5829-5870 already emit the final usage payload and Lines 5912-5918 set doneSent = true, so Anthropic/Bedrock streams still finish without prompt_tokens_details in the common case. That also leaks into the forceStream JSON adapter, since it copies usage from the streamed chunks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/chat.ts` around lines 7110 - 7118, The final (normal
unbuffered) usage emission path still omits prompt_tokens_details, so update the
code that emits the final usage payload (the block that sets doneSent = true) to
include the same conditional spread used elsewhere: include
prompt_tokens_details when (cachedTokens !== null || (cacheCreationTokens !==
null && cacheCreationTokens > 0)) with cached_tokens: cachedTokens ?? 0 and
cache_creation_tokens when applicable; also ensure the forceStream JSON adapter
(which copies usage from streamed chunks) will receive/merge that
prompt_tokens_details by copying usage including prompt_tokens_details rather
than overwriting it. Reference prompt_tokens_details, cachedTokens,
cacheCreationTokens, doneSent and forceStream when making the changes.

Comment on lines +65 to +88
function assertCacheDiscountApplied(usage: any) {
const cachedTokens = usage?.prompt_tokens_details?.cached_tokens ?? 0;
const promptTokens = usage?.prompt_tokens ?? 0;
const uncachedTokens = promptTokens - cachedTokens;
const inputCost = usage?.cost_usd_input;
const cachedInputCost = usage?.cost_usd_cached_input;
if (
typeof inputCost !== "number" ||
typeof cachedInputCost !== "number" ||
cachedTokens === 0 ||
uncachedTokens === 0
) {
// Without both cached and uncached tokens we can't compare per-token
// rates. Skip rather than fail — the test that primes the cache will
// still verify cached_tokens > 0 separately.
return;
}
const uncachedPerToken = inputCost / uncachedTokens;
const cachedPerToken = cachedInputCost / cachedTokens;
expect(
cachedPerToken,
`expected per-token cached cost (${cachedPerToken}) to be less than per-token uncached cost (${uncachedPerToken})`,
).toBeLessThan(uncachedPerToken);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "assertCacheDiscountApplied helper:"
sed -n '65,88p' apps/gateway/src/native-anthropic-cache.e2e.ts

echo
echo "Native call site:"
sed -n '180,205p' apps/gateway/src/native-anthropic-cache.e2e.ts

echo
echo "Native /v1/messages response usage shape:"
sed -n '848,875p' apps/gateway/src/anthropic/anthropic.ts

Repository: theopenco/llmgateway

Length of output: 3161


🏁 Script executed:

# Find all calls to assertCacheDiscountApplied
rg "assertCacheDiscountApplied" apps/gateway/src/ -B 2 -A 2

echo "---"
echo "Check for type definitions related to usage shapes:"
rg "(prompt_tokens|input_tokens|cache_read|cache_creation)" apps/gateway/src/ -t ts -t tsx --iglob "*.d.ts" -l

Repository: theopenco/llmgateway

Length of output: 1663


🏁 Script executed:

#!/bin/bash
# Get line numbers for all calls and surrounding context
rg -n "assertCacheDiscountApplied" apps/gateway/src/native-anthropic-cache.e2e.ts

echo "---"
echo "Check the test contexts:"
sed -n '150,210p' apps/gateway/src/native-anthropic-cache.e2e.ts | head -40

echo "---"
sed -n '250,330p' apps/gateway/src/native-anthropic-cache.e2e.ts | head -40

Repository: theopenco/llmgateway

Length of output: 2714


🏁 Script executed:

#!/bin/bash
# Check the native response construction for cost fields
sed -n '840,920p' apps/gateway/src/anthropic/anthropic.ts

Repository: theopenco/llmgateway

Length of output: 1584


🏁 Script executed:

#!/bin/bash
# Check if native responses include cost fields
rg "cost_usd_input|cost_usd_cached_input" apps/gateway/src/ -t ts -B 2 -A 2

Repository: theopenco/llmgateway

Length of output: 15134


Remove any type and split the cache-discount assertion by response shape.

The helper assertCacheDiscountApplied at line 65 expects OpenAI-compatible response fields (prompt_tokens, prompt_tokens_details.cached_tokens, cost_usd_input, cost_usd_cached_input), but at line 202 it's called with the native Anthropic /v1/messages response, which uses input_tokens, cache_creation_input_tokens, and cache_read_input_tokens instead. The native response also lacks the cost fields. Using any silently hides this mismatch—the native call hits the early return at line 79 (uncachedTokens === 0) and skips validation entirely.

Create separate helpers for native and OpenAI-compatible responses, or provide proper typed parameters instead of any.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/native-anthropic-cache.e2e.ts` around lines 65 - 88, The
current assertCacheDiscountApplied(usage: any) mixes OpenAI and Anthropic shapes
and uses any; split it into two typed helpers and update call sites: create
assertCacheDiscountAppliedOpenAI(usage: { prompt_tokens: number;
prompt_tokens_details?: { cached_tokens?: number }; cost_usd_input?: number;
cost_usd_cached_input?: number }) that preserves the existing per-token cost
assertion (use cached_tokens, prompt_tokens, cost_usd_input,
cost_usd_cached_input) and create assertCacheDiscountAppliedAnthropic(usage: {
input_tokens?: number; cache_creation_input_tokens?: number;
cache_read_input_tokens?: number }) that uses Anthropic fields (treat
cache_read_input_tokens as cachedTokens and derive uncachedTokens from
input_tokens and cache_read_input_tokens) and, since cost fields are absent,
only assert that cachedTokens > 0 and uncachedTokens > 0 (skip per-token cost
comparison); replace usages of assertCacheDiscountApplied to call the
appropriate new helper and remove the any type.

Comment on lines +136 to +145
const longText = buildLongSystemPrompt();
const body = {
model: "anthropic/claude-haiku-4-5",
max_tokens: 50,
system: [
{
type: "text" as const,
text: longText,
cache_control: { type: "ephemeral" as const },
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a below-threshold fixture for the explicit cache_control coverage.

Both of these tests still use buildLongSystemPrompt(), so the legacy length heuristic can make them pass even if cache_control is dropped somewhere in the path. Please switch the explicit-marker cases to a prompt that is intentionally shorter than the heuristic threshold so they prove the new plumbing rather than the fallback.

Also applies to: 476-487

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/native-anthropic-cache.e2e.ts` around lines 136 - 145,
Replace use of buildLongSystemPrompt() in the explicit cache_control test cases
with a below-threshold prompt so the tests exercise the explicit cache_control
plumbing rather than the legacy length heuristic; locate the spots where
longText = buildLongSystemPrompt() and the body.system entry includes
cache_control (the explicit-marker cases) and change to a short prompt (e.g.,
buildShortSystemPrompt() or a hardcoded short string) that is deliberately
shorter than the heuristic threshold, and make the same change at the second
occurrence referenced around the other block (the lines near 476-487) to ensure
both explicit cache_control tests use the short fixture.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/gateway/src/anthropic/anthropic.ts (1)

330-339: ⚠️ Potential issue | 🟠 Major

Mixed tool messages still drop text-block cache markers.

These branches always collapse the remaining text blocks into a plain string, so any explicit cache_control attached to text alongside tool_use or tool_result is discarded before the inner /v1/chat/completions hop. Reusing the same block-preserving logic as Lines 395-428 here would close that gap.

Also applies to: 380-389

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/anthropic/anthropic.ts` around lines 330 - 339, The current
branch collapses mixed message.content into a plain string (via textContent) and
loses any cache_control markers; update the openaiMessages push for mixed
tool/text messages so it preserves block objects (including cache_control)
instead of joining to a string—replace the textContent construction and the
content: textContent || "" assignment in the openaiMessages.push with the same
block-preserving logic used elsewhere in this file (the logic that maps
message.content to an array of blocks preserving type, text, and cache_control),
ensuring tool_calls is still attached; apply the same fix to the other similar
branch referenced in the comment.
♻️ Duplicate comments (1)
apps/gateway/src/anthropic/anthropic.ts (1)

595-607: ⚠️ Potential issue | 🟠 Major

message_start still omits the new cache usage fields.

Line 606 still emits only { input_tokens, output_tokens }, so native streaming clients inspecting message_start.message.usage will not see cache_creation_input_tokens or cache_read_input_tokens until the later delta.

💡 Possible fix
  usage: {
  	input_tokens: 0,
  	output_tokens: 0,
+ 	cache_creation_input_tokens: 0,
+ 	cache_read_input_tokens: 0,
  },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/anthropic/anthropic.ts` around lines 595 - 607, The
message_start event emitted in the stream.writeSSE call for the assistant
message (the block constructing message with id = messageId, role = "assistant",
model = model) currently sets usage to only { input_tokens, output_tokens };
update that usage object in the message_start payload to include
cache_creation_input_tokens and cache_read_input_tokens (initialize them to 0
like the other token counters) so native streaming clients see the full usage
shape immediately. Locate the stream.writeSSE invocation that builds the
"message_start" payload and add the two cache usage fields to
message.message.usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@apps/gateway/src/anthropic/anthropic.ts`:
- Around line 330-339: The current branch collapses mixed message.content into a
plain string (via textContent) and loses any cache_control markers; update the
openaiMessages push for mixed tool/text messages so it preserves block objects
(including cache_control) instead of joining to a string—replace the textContent
construction and the content: textContent || "" assignment in the
openaiMessages.push with the same block-preserving logic used elsewhere in this
file (the logic that maps message.content to an array of blocks preserving type,
text, and cache_control), ensuring tool_calls is still attached; apply the same
fix to the other similar branch referenced in the comment.

---

Duplicate comments:
In `@apps/gateway/src/anthropic/anthropic.ts`:
- Around line 595-607: The message_start event emitted in the stream.writeSSE
call for the assistant message (the block constructing message with id =
messageId, role = "assistant", model = model) currently sets usage to only {
input_tokens, output_tokens }; update that usage object in the message_start
payload to include cache_creation_input_tokens and cache_read_input_tokens
(initialize them to 0 like the other token counters) so native streaming clients
see the full usage shape immediately. Locate the stream.writeSSE invocation that
builds the "message_start" payload and add the two cache usage fields to
message.message.usage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: b5186f0b-9416-417f-86eb-a1edba018268

📥 Commits

Reviewing files that changed from the base of the PR and between c89fa4b and f364df1.

📒 Files selected for processing (1)
  • apps/gateway/src/anthropic/anthropic.ts

Place a cache_control / cachePoint marker on the last content block of
the message just before the final user turn. This caches the entire
conversation prefix (all prior turns) instead of only caching individual
text blocks that exceed a length threshold.

Before: only the system prompt was cached (~16k tokens), and the
conversation history (~100k+ tokens) was reprocessed on every request.

After: the entire prefix up to the previous turn is cached, so only the
new user message and the model's response are uncached. This
dramatically improves the cache hit ratio for long multi-turn
conversations (e.g. Claude Code sessions).

Applied to both Anthropic (cache_control: {type: "ephemeral"}) and AWS
Bedrock (cachePoint: {type: "default"}) paths, respecting the existing
4-block limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/actions/src/transform-anthropic-messages.ts (1)

124-139: ⚠️ Potential issue | 🟠 Major

Count preserved cache_control blocks before adding new ones.

Lines 124-139 return caller-supplied cache_control text parts unchanged, but they never increment cacheControlCount. That lets Lines 296-330 append a turn-boundary marker even when four explicit markers are already present, which can push Anthropic requests past the 4-block cap.

Proposed fix
-					if (isTextContent(part) && part.text && !part.cache_control) {
-						// Automatically add cache_control for long text blocks
-						const shouldCache =
-							shouldApplyCacheControl &&
-							part.text.length >= minCacheableChars &&
-							cacheControlCount < maxCacheControlBlocks;
-						if (shouldCache) {
-							cacheControlCount++;
-							return {
-								...part,
-								cache_control: { type: "ephemeral" },
-							};
-						}
+					if (isTextContent(part) && part.text) {
+						if (part.cache_control) {
+							if (cacheControlCount < maxCacheControlBlocks) {
+								cacheControlCount++;
+								return part;
+							}
+							const { cache_control: _ignored, ...rest } = part;
+							return rest;
+						}
+						// Automatically add cache_control for long text blocks
+						const shouldCache =
+							shouldApplyCacheControl &&
+							part.text.length >= minCacheableChars &&
+							cacheControlCount < maxCacheControlBlocks;
+						if (shouldCache) {
+							cacheControlCount++;
+							return {
+								...part,
+								cache_control: { type: "ephemeral" },
+							};
+						}
 					}

Also applies to: 296-330

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/actions/src/transform-anthropic-messages.ts` around lines 124 - 139,
The code that iterates message parts (using cacheControlCount,
shouldApplyCacheControl, minCacheableChars, maxCacheControlBlocks) currently
skips incrementing cacheControlCount when a caller-supplied part already has a
cache_control field, allowing the code later that appends a turn-boundary marker
to exceed Anthropic's 4-block cap; modify the parts-mapping logic to detect
existing part.cache_control and increment cacheControlCount when present (and
similarly update the analogous logic around the turn-boundary appending code
that uses the same variables) so both caller-provided and auto-added
cache_control blocks are counted before adding any additional markers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/actions/src/prepare-request-body.ts`:
- Around line 1170-1249: The current logic flips a global callerSetCacheControl
mode which disables heuristic caching for all system blocks if any block
contains an explicit cache_control; instead, iterate per system message and for
each sysMsg detect whether that specific sysMsg (via
Array.isArray(sysMsg.content) && sysMsg.content.some(c => isTextContent(c) &&
!!c.cache_control)) carries explicit cache_control; if a sysMsg has explicit
markers, preserve those explicit cache_control entries verbatim (respecting
systemCacheControlCount and maxCacheControlBlocks), otherwise treat that single
sysMsg with the legacy concatenation+heuristic path (using minCacheableChars and
incrementing systemCacheControlCount when you apply an ephemeral cache_control),
updating systemContent accordingly; adjust/remove the global
callerSetCacheControl branch and use per-message logic around systemMessages,
systemContent, systemCacheControlCount, maxCacheControlBlocks,
minCacheableChars, isTextContent and cache_control.

---

Outside diff comments:
In `@packages/actions/src/transform-anthropic-messages.ts`:
- Around line 124-139: The code that iterates message parts (using
cacheControlCount, shouldApplyCacheControl, minCacheableChars,
maxCacheControlBlocks) currently skips incrementing cacheControlCount when a
caller-supplied part already has a cache_control field, allowing the code later
that appends a turn-boundary marker to exceed Anthropic's 4-block cap; modify
the parts-mapping logic to detect existing part.cache_control and increment
cacheControlCount when present (and similarly update the analogous logic around
the turn-boundary appending code that uses the same variables) so both
caller-provided and auto-added cache_control blocks are counted before adding
any additional markers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f54427df-d6a3-478a-8042-6ebfea4ef69b

📥 Commits

Reviewing files that changed from the base of the PR and between f364df1 and f1ac94b.

📒 Files selected for processing (3)
  • packages/actions/src/prepare-request-body.spec.ts
  • packages/actions/src/prepare-request-body.ts
  • packages/actions/src/transform-anthropic-messages.ts

Comment on lines +1170 to 1249
// Detect whether any text block in the incoming system messages has
// a caller-supplied cache_control marker. If so, we preserve the
// per-block structure so we can forward markers verbatim. Otherwise
// we fall back to the legacy behavior of concatenating each system
// message's text into a single block (and applying the length-based
// heuristic per concatenated block).
const callerSetCacheControl = systemMessages.some((sysMsg) => {
if (!Array.isArray(sysMsg.content)) {
return false;
}
return sysMsg.content.some(
(c) => isTextContent(c) && !!c.cache_control,
);
});

if (!text || text.trim() === "") {
continue;
if (callerSetCacheControl) {
for (const sysMsg of systemMessages) {
if (typeof sysMsg.content === "string") {
if (!sysMsg.content.trim()) {
continue;
}
systemContent.push({ type: "text", text: sysMsg.content });
} else if (Array.isArray(sysMsg.content)) {
for (const part of sysMsg.content) {
if (!isTextContent(part) || !part.text || !part.text.trim()) {
continue;
}
const explicit = part.cache_control;
if (explicit) {
if (systemCacheControlCount < maxCacheControlBlocks) {
systemCacheControlCount++;
systemContent.push({
type: "text",
text: part.text,
cache_control: explicit,
});
} else {
systemContent.push({ type: "text", text: part.text });
}
} else {
systemContent.push({ type: "text", text: part.text });
}
}
}
}
} else {
for (const sysMsg of systemMessages) {
let text: string;
if (typeof sysMsg.content === "string") {
text = sysMsg.content;
} else if (Array.isArray(sysMsg.content)) {
// Concatenate text from array content (legacy behavior).
text = sysMsg.content
.filter((c) => c.type === "text" && "text" in c)
.map((c) => (c as { type: "text"; text: string }).text)
.join("");
} else {
continue;
}

// Add cache_control for text blocks exceeding the model's minimum cacheable threshold
const shouldCache =
text.length >= minCacheableChars &&
systemCacheControlCount < maxCacheControlBlocks;

if (shouldCache) {
systemCacheControlCount++;
systemContent.push({
type: "text",
text,
cache_control: { type: "ephemeral" },
});
} else {
systemContent.push({
type: "text",
text,
});
if (!text || text.trim() === "") {
continue;
}

const shouldCache =
text.length >= minCacheableChars &&
systemCacheControlCount < maxCacheControlBlocks;

if (shouldCache) {
systemCacheControlCount++;
systemContent.push({
type: "text",
text,
cache_control: { type: "ephemeral" },
});
} else {
systemContent.push({ type: "text", text });
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't disable heuristic caching for every system block once one explicit marker appears.

Both branches switch into an all-or-nothing mode via callerSetCacheControl / callerSetBedrockCacheControl. With mixed inputs, an unmarked long system block stops getting heuristic caching just because some other system block had explicit cache_control. That breaks the documented “preserve explicit markers, fall back when absent” behavior.

Also applies to: 1436-1454

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/actions/src/prepare-request-body.ts` around lines 1170 - 1249, The
current logic flips a global callerSetCacheControl mode which disables heuristic
caching for all system blocks if any block contains an explicit cache_control;
instead, iterate per system message and for each sysMsg detect whether that
specific sysMsg (via Array.isArray(sysMsg.content) && sysMsg.content.some(c =>
isTextContent(c) && !!c.cache_control)) carries explicit cache_control; if a
sysMsg has explicit markers, preserve those explicit cache_control entries
verbatim (respecting systemCacheControlCount and maxCacheControlBlocks),
otherwise treat that single sysMsg with the legacy concatenation+heuristic path
(using minCacheableChars and incrementing systemCacheControlCount when you apply
an ephemeral cache_control), updating systemContent accordingly; adjust/remove
the global callerSetCacheControl branch and use per-message logic around
systemMessages, systemContent, systemCacheControlCount, maxCacheControlBlocks,
minCacheableChars, isTextContent and cache_control.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant