-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Expected Behavior
Developers should be able to designate portions of the system message as static (cacheable) vs dynamic (uncacheable), with buildSystemContent() emitting separate Anthropic ContentBlocks accordingly:
{
"system": [
{"type": "text", "text": "static instructions...", "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": "dynamic per-request context..."}
]
}Anthropic's API already supports arrays of content blocks in the system field, each with independent cache_control (up to 4 breakpoints per request). The static block would get cache hits on subsequent requests while the dynamic block is processed fresh.
One possible API shape — a cache policy on SystemMessage:
// Static behavioral instructions — cached
new SystemMessage("safety guardrails...", CachePolicy.CACHEABLE);
// Dynamic advisor-injected context — not cached
new SystemMessage("product recommendations...", CachePolicy.NO_CACHE);Another option would be at the advisor level, letting advisors declare whether their augmentSystemMessage() output belongs in a cached or uncached block.
Current Behavior
AnthropicChatModel.buildSystemContent() concatenates all system messages into a single string via Collectors.joining(), then wraps the result in one ContentBlock with one cache_control marker.
This means if any part of the system message is dynamic, the entire system cache misses every request. With SYSTEM_ONLY strategy enabled, each miss pays Anthropic's 1.25x cache-write cost with zero reads — making it more expensive than disabling caching entirely.
There is no way to split static from dynamic content within the system message using the current API. The only workaround is moving dynamic content out of the system message entirely.
Context
Our application uses Anthropic Sonnet for conversational AI agents. The system message has two layers:
- Static behavioral instructions (~2,280 tokens): safety guardrails, domain principles, brand tone, mode instructions, response formatting. Identical across all requests for a given tenant + mode.
- Dynamic context injected by Spring AI advisors: RAG-matched product recommendations (change every query), intent/slot state (change every turn), temporal context (changes daily).
Because buildSystemContent() concatenates everything into one block, we can't cache the static prefix independently. Enabling SYSTEM_ONLY with any dynamic advisor content means 100% cache misses at 1.25x write cost.
Workaround: We moved all dynamic content from the system message to the user message, keeping the system message fully static and cacheable. This works — cache reads cost 0.1x ($0.30/MTok vs $3/MTok uncached for Sonnet) — but it conflates the semantic distinction between system instructions (developer authority) and user-level context. Product recommendations and intent state are contextual data the model reads, not behavioral instructions it follows, so the practical impact is minimal. But ideally the framework wouldn't force this trade-off.
Alternatives considered:
- Bypassing Spring AI to call the Anthropic API directly with fine-grained block placement — loses the entire advisor pipeline
- Waiting for multi-block support — filed this issue
Related: #4325 (per-message-type TTLs and min-size thresholds) is complementary but doesn't address static/dynamic splitting within system messages.
References:
- Anthropic Prompt Caching docs — multi-block system message examples
- Spring AI version: 1.1.0