-
-
Notifications
You must be signed in to change notification settings - Fork 227
prompt caching SPIKE #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prompt caching SPIKE #109
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
# Prompt Caching | ||
|
||
RubyLLM supports Anthropic's prompt caching feature, which allows you to cache parts of your prompts to reduce token usage and costs when making similar requests. | ||
|
||
## What is Prompt Caching? | ||
|
||
Prompt caching is a feature that allows you to mark specific parts of your prompt as cacheable. When you make a request with a cached prompt, Anthropic will: | ||
|
||
1. Check if the prompt prefix (up to the cache breakpoint) is already cached | ||
2. If found, use the cached version, reducing processing time and costs | ||
3. Otherwise, process the full prompt and cache the prefix | ||
|
||
This is especially useful for: | ||
|
||
- Prompts with many examples | ||
- Large amounts of context or background information | ||
- Repetitive tasks with consistent instructions | ||
- Long multi-turn conversations | ||
|
||
## Supported Models | ||
|
||
Prompt caching is currently supported on the following Anthropic Claude models: | ||
|
||
- Claude 3.7 Sonnet | ||
- Claude 3.5 Sonnet | ||
- Claude 3.5 Haiku | ||
- Claude 3 Haiku | ||
- Claude 3 Opus | ||
|
||
## How to Use Prompt Caching | ||
|
||
To use prompt caching in RubyLLM, you can mark content as cacheable using the `cache_control` parameter: | ||
|
||
```ruby | ||
# Create a chat with a Claude model | ||
chat = RubyLLM.chat(model: 'claude-3-5-sonnet') | ||
|
||
# Add a system message with cache control | ||
chat.with_instructions("You are an AI assistant tasked with analyzing literary works.", | ||
cache_control: true) | ||
|
||
# Add a large document with cache control | ||
chat.ask("Here's the entire text of Pride and Prejudice: [long text...]", | ||
with: { cache_control: true }) | ||
|
||
# Now you can ask questions about the document without reprocessing it | ||
chat.ask("Analyze the major themes in Pride and Prejudice.") | ||
``` | ||
|
||
## Pricing | ||
|
||
Prompt caching introduces a different pricing structure: | ||
|
||
| Model | Base Input Tokens | Cache Writes | Cache Hits | Output Tokens | | ||
| ----------------- | ----------------- | ------------- | ------------ | ------------- | | ||
| Claude 3.7 Sonnet | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok | | ||
| Claude 3.5 Sonnet | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok | | ||
| Claude 3.5 Haiku | $0.80 / MTok | $1 / MTok | $0.08 / MTok | $4 / MTok | | ||
| Claude 3 Haiku | $0.25 / MTok | $0.30 / MTok | $0.03 / MTok | $1.25 / MTok | | ||
| Claude 3 Opus | $15 / MTok | $18.75 / MTok | $1.50 / MTok | $75 / MTok | | ||
|
||
Note: | ||
|
||
- Cache write tokens are 25% more expensive than base input tokens | ||
- Cache read tokens are 90% cheaper than base input tokens | ||
- Regular input and output tokens are priced at standard rates | ||
|
||
## Tracking Cache Performance | ||
|
||
When using prompt caching, you can track the cache performance using the following fields in the response: | ||
|
||
```ruby | ||
response = chat.ask("What are the main characters in Pride and Prejudice?") | ||
|
||
puts "Cache creation tokens: #{response.cache_creation_input_tokens}" | ||
puts "Cache read tokens: #{response.cache_read_input_tokens}" | ||
puts "Regular input tokens: #{response.input_tokens}" | ||
puts "Output tokens: #{response.output_tokens}" | ||
``` | ||
|
||
## Cache Limitations | ||
|
||
- The minimum cacheable prompt length is: | ||
- 1024 tokens for Claude 3.7 Sonnet, Claude 3.5 Sonnet, and Claude 3 Opus | ||
- 2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku | ||
- Shorter prompts cannot be cached, even if marked with `cache_control` | ||
- The cache has a minimum 5-minute lifetime | ||
- Cache hits require 100% identical prompt segments | ||
|
||
## Best Practices | ||
|
||
- Place static content (system instructions, context, examples) at the beginning of your prompt | ||
- Mark the end of the reusable content for caching using the `cache_control` parameter | ||
- Use cache breakpoints strategically to separate different cacheable prefix sections | ||
- Regularly analyze cache hit rates and adjust your strategy as needed | ||
|
||
## Example: Document Analysis | ||
|
||
```ruby | ||
# Create a chat with Claude | ||
chat = RubyLLM.chat(model: 'claude-3-5-sonnet') | ||
|
||
# Add system instructions with cache control | ||
chat.with_instructions("You are an AI assistant tasked with analyzing documents.", | ||
cache_control: true) | ||
|
||
# Add a PDF document with cache control | ||
chat.ask("Please analyze this document:", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this API seems fine to me. We need to tag the messages with "this is where to cache", so the with option seems like a fine place for that. |
||
with: { pdf: "large_document.pdf", cache_control: true }) | ||
|
||
# First query - will create a cache | ||
response1 = chat.ask("What are the main points in the executive summary?") | ||
puts "Cache creation tokens: #{response1.cache_creation_input_tokens}" | ||
|
||
# Second query - will use the cache | ||
response2 = chat.ask("Who are the key stakeholders mentioned?") | ||
puts "Cache read tokens: #{response2.cache_read_input_tokens}" | ||
``` | ||
|
||
## Example: Multi-turn Conversation | ||
|
||
```ruby | ||
# Create a chat with Claude | ||
chat = RubyLLM.chat(model: 'claude-3-5-sonnet') | ||
|
||
# Add system instructions with cache control | ||
chat.with_instructions("You are a helpful coding assistant. Use these coding conventions: [long list of conventions]", | ||
cache_control: true) | ||
|
||
# First query - will create a cache | ||
response1 = chat.ask("How do I write a Ruby class for a bank account?") | ||
puts "Cache creation tokens: #{response1.cache_creation_input_tokens}" | ||
|
||
# Second query - will use the cache | ||
response2 = chat.ask("Can you show me how to add a transfer method to that class?") | ||
puts "Cache read tokens: #{response2.cache_read_input_tokens}" | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,16 +26,22 @@ def initialize(model: nil, provider: nil) | |
end | ||
|
||
def ask(message = nil, with: {}, &block) | ||
add_message role: :user, content: Content.new(message, with) | ||
# Extract cache_control from the with hash if present | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm unsure if this is how we want to handle things. this essentailly adds global functionality for an anthropic specific concept. |
||
cache_control = with.delete(:cache_control) | ||
|
||
# Create a new Content object with the message and attachments | ||
content = Content.new(message, with.merge(cache_control: cache_control)) | ||
|
||
add_message role: :user, content: content | ||
complete(&block) | ||
end | ||
|
||
alias say ask | ||
|
||
def with_instructions(instructions, replace: false) | ||
def with_instructions(instructions, replace: false, cache_control: nil) | ||
@messages = @messages.reject! { |msg| msg.role == :system } if replace | ||
|
||
add_message role: :system, content: instructions | ||
add_message role: :system, content: Content.new(instructions, cache_control: cache_control) | ||
self | ||
end | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,6 +67,29 @@ def supports_json_mode?(model_id) | |
def supports_extended_thinking?(model_id) | ||
model_id.match?(/claude-3-7-sonnet/) | ||
end | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I imagine there is an existing philosophy how to use this data I am unaware of |
||
# Determines if a model supports prompt caching | ||
# @param model_id [String] the model identifier | ||
# @return [Boolean] true if the model supports prompt caching | ||
def supports_caching?(model_id) | ||
model_id.match?(/claude-3(?:-[357])?(?:-(?:opus|sonnet|haiku))/) | ||
end | ||
|
||
# Gets the cache write price per million tokens for a given model | ||
# @param model_id [String] the model identifier | ||
# @return [Float] the price per million tokens for cache writes | ||
def cache_write_price_for(model_id) | ||
# Cache write tokens are 25% more expensive than base input tokens | ||
get_input_price(model_id) * 1.25 | ||
end | ||
|
||
# Gets the cache hit price per million tokens for a given model | ||
# @param model_id [String] the model identifier | ||
# @return [Float] the price per million tokens for cache hits | ||
def cache_hit_price_for(model_id) | ||
# Cache read tokens are 90% cheaper than base input tokens | ||
get_input_price(model_id) * 0.1 | ||
end | ||
|
||
# Determines the model family for a given model ID | ||
# @param model_id [String] the model identifier | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think Cline got a little overzealous w/ documentation