Skip to content

Comments

Bug Fix: auto-inject prompt caching support for Gemini models#21881

Open
RheagalFire wants to merge 1 commit intoBerriAI:mainfrom
RheagalFire:feat/autoinject_gemini_cache
Open

Bug Fix: auto-inject prompt caching support for Gemini models#21881
RheagalFire wants to merge 1 commit intoBerriAI:mainfrom
RheagalFire:feat/autoinject_gemini_cache

Conversation

@RheagalFire
Copy link
Contributor

@RheagalFire RheagalFire commented Feb 22, 2026

Relevant issues

Fixes #18519

Problem

cache_control_injection_points in proxy YAML config worked for Anthropic models but was silently ignored for Gemini. The root cause: when the hook injected cache_control on
string-content messages (common for system messages), it set message["cache_control"] at the message level. But is_cached_message() only checked for cache_control on content
items inside a list — so message-level markers were never detected by the Gemini context caching pipeline.

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Type

🐛 Bug Fix

Changes

  • litellm/utils.py: Add message-level cache_control detection in is_cached_message() before the existing content-list check
    • litellm/llms/vertex_ai/context_caching/vertex_ai_context_caching.py: Add minimum token count guard (1024) in both sync and async check_and_create_cache() — skips caching
      gracefully instead of hitting a Gemini API error
    • Tests: 11 new tests across 3 files covering message-level detection, Gemini integration flow, and min-token skip behavior

Test plan

  • pytest tests/test_litellm/test_utils.py::TestIsCachedMessage — 12 pass (3 new)
  • pytest tests/test_litellm/integrations/test_anthropic_cache_control_hook.py — 2 new Gemini tests pass
  • pytest tests/test_litellm/llms/vertex_ai/context_caching/test_vertex_ai_context_caching.py — 72 pass (6 new), 0 regressions

@vercel
Copy link

vercel bot commented Feb 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Feb 22, 2026 8:20am

Request Review

@RheagalFire
Copy link
Contributor Author

@greptileai

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 22, 2026

Greptile Summary

This PR fixes a bug where cache_control_injection_points in proxy YAML config was silently ignored for Gemini models (issue #18519). The root cause was that the AnthropicCacheControlHook sets message["cache_control"] at the message level for string-content messages, but is_cached_message() only checked for cache_control on content items inside a list.

  • Core fix in litellm/utils.py: Adds message-level cache_control detection in is_cached_message() before the existing content-list check, enabling Gemini's context caching pipeline to detect markers injected by the hook on string-content messages.
  • Min-token guard in vertex_ai_context_caching.py: Adds a minimum token count check (1024, the Gemini requirement) in both sync and async check_and_create_cache() — gracefully skips caching instead of triggering a Gemini API error.
  • Tests: 11 new tests across 3 files covering message-level detection, Gemini integration flow, and min-token skip behavior. All tests are properly mocked with no real network calls.

Confidence Score: 4/5

  • This PR is safe to merge — it adds detection for an already-existing message format and a graceful fallback guard with no behavioral regressions.
  • The core fix in is_cached_message() is minimal and correct, adding detection for message-level cache_control that was already being set by the hook. The min-token guard is a sensible defensive addition that prevents API errors. Tests are thorough and well-structured. Only a minor import ordering style issue was found.
  • No files require special attention. The changes in litellm/llms/vertex_ai/context_caching/vertex_ai_context_caching.py have a minor import ordering issue but are functionally correct.

Important Files Changed

Filename Overview
litellm/utils.py Adds message-level cache_control detection in is_cached_message() before the existing content-list check. Clean, correct logic with proper type guards.
litellm/llms/vertex_ai/context_caching/vertex_ai_context_caching.py Adds minimum token count guard (1024) in both sync and async check_and_create_cache() methods. Minor import ordering issue. Logic is correct and avoids Gemini API errors for small cached content.
tests/test_litellm/integrations/test_anthropic_cache_control_hook.py Two new tests verify the full flow for Gemini cache_control injection with both string and list content. Well-structured, no network calls.
tests/test_litellm/llms/vertex_ai/context_caching/test_vertex_ai_context_caching.py Adds setup/teardown for mocking is_prompt_caching_valid_prompt and two new parameterized tests (sync + async) for the min-token skip behavior. No real network calls are made.
tests/test_litellm/test_utils.py Three new unit tests for message-level cache_control detection in is_cached_message(), covering positive, wrong-type, and non-dict cases.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Proxy receives request with<br/>cache_control_injection_points] --> B[AnthropicCacheControlHook<br/>injects cache_control]
    B --> C{Message content type?}
    C -->|String content| D["Sets message-level<br/>cache_control = {type: ephemeral}"]
    C -->|List content| E["Sets cache_control on<br/>last content item"]
    D --> F["is_cached_message()"]
    E --> F
    F --> G{Detected as cached?}
    G -->|"Yes (NEW: message-level check)"| H["separate_cached_messages()<br/>splits messages"]
    G -->|No - was broken before| I[Cache silently skipped]
    H --> J{"is_prompt_caching_valid_prompt()<br/>(NEW: min 1024 tokens?)"}
    J -->|Below threshold| K[Skip caching gracefully]
    J -->|Above threshold| L[Proceed with Gemini<br/>context caching API]

    style D fill:#90EE90
    style J fill:#90EE90
    style K fill:#90EE90
Loading

Last reviewed commit: bf8d2a3

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines 6 to +16
from litellm.caching.caching import Cache, LiteLLMCacheType
from litellm.constants import MINIMUM_PROMPT_CACHE_TOKEN_COUNT
from litellm.litellm_core_utils.litellm_logging import Logging
from litellm.llms.custom_httpx.http_handler import (
AsyncHTTPHandler,
HTTPHandler,
get_async_httpx_client,
)
from litellm._logging import verbose_logger
from litellm.llms.openai.openai import AllMessageValues
from litellm.utils import is_prompt_caching_valid_prompt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import ordering is inconsistent
The litellm._logging import is placed between the litellm.llms.* imports, breaking the alphabetical grouping. Consider moving it to be adjacent to the other litellm.* (non-llms) imports for consistency.

Suggested change
from litellm.caching.caching import Cache, LiteLLMCacheType
from litellm.constants import MINIMUM_PROMPT_CACHE_TOKEN_COUNT
from litellm.litellm_core_utils.litellm_logging import Logging
from litellm.llms.custom_httpx.http_handler import (
AsyncHTTPHandler,
HTTPHandler,
get_async_httpx_client,
)
from litellm._logging import verbose_logger
from litellm.llms.openai.openai import AllMessageValues
from litellm.utils import is_prompt_caching_valid_prompt
from litellm._logging import verbose_logger
from litellm.caching.caching import Cache, LiteLLMCacheType
from litellm.constants import MINIMUM_PROMPT_CACHE_TOKEN_COUNT
from litellm.litellm_core_utils.litellm_logging import Logging
from litellm.llms.custom_httpx.http_handler import (
AsyncHTTPHandler,
HTTPHandler,
get_async_httpx_client,
)
from litellm.llms.openai.openai import AllMessageValues
from litellm.utils import is_prompt_caching_valid_prompt

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Auto-Inject Prompt Caching not supported for Gemini Models

1 participant