Skip to content

prompt caching SPIKE #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/_data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
url: /guides/tools
- title: Streaming
url: /guides/streaming
- title: Prompt Caching
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think Cline got a little overzealous w/ documentation

url: /guides/prompt-caching
- title: Rails Integration
url: /guides/rails
- title: Image Generation
Expand Down
15 changes: 14 additions & 1 deletion docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,30 +13,43 @@ This section contains detailed guides to help you make the most of RubyLLM. Each
## Available Guides

### [Getting Started]({% link guides/getting-started.md %})

Learn the basics of RubyLLM and get up and running quickly with simple examples.

### [Chat]({% link guides/chat.md %})

Explore the chat interface, which is the primary way to interact with AI models through RubyLLM.

### [Tools]({% link guides/tools.md %})

Learn how to extend AI capabilities by creating tools that let models call your Ruby code.

### [Streaming]({% link guides/streaming.md %})

Understand how to use streaming responses for real-time interactions.

### [Prompt Caching]({% link guides/prompt-caching.md %})

Learn how to use Anthropic's prompt caching feature to reduce token usage and costs.

### [Rails Integration]({% link guides/rails.md %})

See how to integrate RubyLLM with Rails applications, including ActiveRecord persistence.

### [Image Generation]({% link guides/image-generation.md %})

Learn how to generate images using DALL-E and other providers.

### [Embeddings]({% link guides/embeddings.md %})

Explore how to create vector embeddings for semantic search and other applications.

### [Error Handling]({% link guides/error-handling.md %})

Master the techniques for robust error handling in AI applications.

### [Working with Models]({% link guides/models.md %})

Learn how to discover, select, and work with different AI models across providers.

## Getting Help
Expand All @@ -45,4 +58,4 @@ If you can't find what you're looking for in these guides, consider:

1. Checking the [API Documentation]() for detailed information about specific classes and methods
2. Looking at the [GitHub repository](https://github.com/crmne/ruby_llm) for examples and the latest updates
3. Filing an issue on GitHub if you find a bug or have a feature request
3. Filing an issue on GitHub if you find a bug or have a feature request
137 changes: 137 additions & 0 deletions docs/guides/prompt-caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Prompt Caching

RubyLLM supports Anthropic's prompt caching feature, which allows you to cache parts of your prompts to reduce token usage and costs when making similar requests.

## What is Prompt Caching?

Prompt caching is a feature that allows you to mark specific parts of your prompt as cacheable. When you make a request with a cached prompt, Anthropic will:

1. Check if the prompt prefix (up to the cache breakpoint) is already cached
2. If found, use the cached version, reducing processing time and costs
3. Otherwise, process the full prompt and cache the prefix

This is especially useful for:

- Prompts with many examples
- Large amounts of context or background information
- Repetitive tasks with consistent instructions
- Long multi-turn conversations

## Supported Models

Prompt caching is currently supported on the following Anthropic Claude models:

- Claude 3.7 Sonnet
- Claude 3.5 Sonnet
- Claude 3.5 Haiku
- Claude 3 Haiku
- Claude 3 Opus

## How to Use Prompt Caching

To use prompt caching in RubyLLM, you can mark content as cacheable using the `cache_control` parameter:

```ruby
# Create a chat with a Claude model
chat = RubyLLM.chat(model: 'claude-3-5-sonnet')

# Add a system message with cache control
chat.with_instructions("You are an AI assistant tasked with analyzing literary works.",
cache_control: true)

# Add a large document with cache control
chat.ask("Here's the entire text of Pride and Prejudice: [long text...]",
with: { cache_control: true })

# Now you can ask questions about the document without reprocessing it
chat.ask("Analyze the major themes in Pride and Prejudice.")
```

## Pricing

Prompt caching introduces a different pricing structure:

| Model | Base Input Tokens | Cache Writes | Cache Hits | Output Tokens |
| ----------------- | ----------------- | ------------- | ------------ | ------------- |
| Claude 3.7 Sonnet | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok |
| Claude 3.5 Sonnet | $3 / MTok | $3.75 / MTok | $0.30 / MTok | $15 / MTok |
| Claude 3.5 Haiku | $0.80 / MTok | $1 / MTok | $0.08 / MTok | $4 / MTok |
| Claude 3 Haiku | $0.25 / MTok | $0.30 / MTok | $0.03 / MTok | $1.25 / MTok |
| Claude 3 Opus | $15 / MTok | $18.75 / MTok | $1.50 / MTok | $75 / MTok |

Note:

- Cache write tokens are 25% more expensive than base input tokens
- Cache read tokens are 90% cheaper than base input tokens
- Regular input and output tokens are priced at standard rates

## Tracking Cache Performance

When using prompt caching, you can track the cache performance using the following fields in the response:

```ruby
response = chat.ask("What are the main characters in Pride and Prejudice?")

puts "Cache creation tokens: #{response.cache_creation_input_tokens}"
puts "Cache read tokens: #{response.cache_read_input_tokens}"
puts "Regular input tokens: #{response.input_tokens}"
puts "Output tokens: #{response.output_tokens}"
```

## Cache Limitations

- The minimum cacheable prompt length is:
- 1024 tokens for Claude 3.7 Sonnet, Claude 3.5 Sonnet, and Claude 3 Opus
- 2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku
- Shorter prompts cannot be cached, even if marked with `cache_control`
- The cache has a minimum 5-minute lifetime
- Cache hits require 100% identical prompt segments

## Best Practices

- Place static content (system instructions, context, examples) at the beginning of your prompt
- Mark the end of the reusable content for caching using the `cache_control` parameter
- Use cache breakpoints strategically to separate different cacheable prefix sections
- Regularly analyze cache hit rates and adjust your strategy as needed

## Example: Document Analysis

```ruby
# Create a chat with Claude
chat = RubyLLM.chat(model: 'claude-3-5-sonnet')

# Add system instructions with cache control
chat.with_instructions("You are an AI assistant tasked with analyzing documents.",
cache_control: true)

# Add a PDF document with cache control
chat.ask("Please analyze this document:",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this API seems fine to me. We need to tag the messages with "this is where to cache", so the with option seems like a fine place for that.

with: { pdf: "large_document.pdf", cache_control: true })

# First query - will create a cache
response1 = chat.ask("What are the main points in the executive summary?")
puts "Cache creation tokens: #{response1.cache_creation_input_tokens}"

# Second query - will use the cache
response2 = chat.ask("Who are the key stakeholders mentioned?")
puts "Cache read tokens: #{response2.cache_read_input_tokens}"
```

## Example: Multi-turn Conversation

```ruby
# Create a chat with Claude
chat = RubyLLM.chat(model: 'claude-3-5-sonnet')

# Add system instructions with cache control
chat.with_instructions("You are a helpful coding assistant. Use these coding conventions: [long list of conventions]",
cache_control: true)

# First query - will create a cache
response1 = chat.ask("How do I write a Ruby class for a bank account?")
puts "Cache creation tokens: #{response1.cache_creation_input_tokens}"

# Second query - will use the cache
response2 = chat.ask("Can you show me how to add a transfer method to that class?")
puts "Cache read tokens: #{response2.cache_read_input_tokens}"
```
12 changes: 9 additions & 3 deletions lib/ruby_llm/chat.rb
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,22 @@ def initialize(model: nil, provider: nil)
end

def ask(message = nil, with: {}, &block)
add_message role: :user, content: Content.new(message, with)
# Extract cache_control from the with hash if present
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if this is how we want to handle things. this essentailly adds global functionality for an anthropic specific concept.

cache_control = with.delete(:cache_control)

# Create a new Content object with the message and attachments
content = Content.new(message, with.merge(cache_control: cache_control))

add_message role: :user, content: content
complete(&block)
end

alias say ask

def with_instructions(instructions, replace: false)
def with_instructions(instructions, replace: false, cache_control: nil)
@messages = @messages.reject! { |msg| msg.role == :system } if replace

add_message role: :system, content: instructions
add_message role: :system, content: Content.new(instructions, cache_control: cache_control)
self
end

Expand Down
5 changes: 4 additions & 1 deletion lib/ruby_llm/content.rb
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,12 @@ module RubyLLM
# Stores data in a standard internal format, letting providers
# handle their own formatting needs.
class Content
attr_reader :cache_control

def initialize(text = nil, attachments = {}) # rubocop:disable Metrics/AbcSize,Metrics/MethodLength
@parts = []
@parts << { type: 'text', text: text } unless text.nil? || text.empty?
@cache_control = attachments[:cache_control]

Array(attachments[:image]).each do |source|
@parts << attach_image(source)
Expand All @@ -29,7 +32,7 @@ def to_a
end

def format
return @parts.first[:text] if @parts.size == 1 && @parts.first[:type] == 'text'
return @parts.first[:text] if @parts.size == 1 && @parts.first[:type] == 'text' && @cache_control.nil?

to_a
end
Expand Down
9 changes: 7 additions & 2 deletions lib/ruby_llm/message.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@ module RubyLLM
class Message
ROLES = %i[system user assistant tool].freeze

attr_reader :role, :content, :tool_calls, :tool_call_id, :input_tokens, :output_tokens, :model_id
attr_reader :role, :content, :tool_calls, :tool_call_id, :input_tokens, :output_tokens, :model_id,
:cache_creation_input_tokens, :cache_read_input_tokens

def initialize(options = {})
@role = options[:role].to_sym
Expand All @@ -17,6 +18,8 @@ def initialize(options = {})
@output_tokens = options[:output_tokens]
@model_id = options[:model_id]
@tool_call_id = options[:tool_call_id]
@cache_creation_input_tokens = options[:cache_creation_input_tokens]
@cache_read_input_tokens = options[:cache_read_input_tokens]

ensure_valid_role
end
Expand All @@ -41,7 +44,9 @@ def to_h
tool_call_id: tool_call_id,
input_tokens: input_tokens,
output_tokens: output_tokens,
model_id: model_id
model_id: model_id,
cache_creation_input_tokens: cache_creation_input_tokens,
cache_read_input_tokens: cache_read_input_tokens
}.compact
end

Expand Down
23 changes: 23 additions & 0 deletions lib/ruby_llm/providers/anthropic/capabilities.rb
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,29 @@ def supports_json_mode?(model_id)
def supports_extended_thinking?(model_id)
model_id.match?(/claude-3-7-sonnet/)
end

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine there is an existing philosophy how to use this data I am unaware of

# Determines if a model supports prompt caching
# @param model_id [String] the model identifier
# @return [Boolean] true if the model supports prompt caching
def supports_caching?(model_id)
model_id.match?(/claude-3(?:-[357])?(?:-(?:opus|sonnet|haiku))/)
end

# Gets the cache write price per million tokens for a given model
# @param model_id [String] the model identifier
# @return [Float] the price per million tokens for cache writes
def cache_write_price_for(model_id)
# Cache write tokens are 25% more expensive than base input tokens
get_input_price(model_id) * 1.25
end

# Gets the cache hit price per million tokens for a given model
# @param model_id [String] the model identifier
# @return [Float] the price per million tokens for cache hits
def cache_hit_price_for(model_id)
# Cache read tokens are 90% cheaper than base input tokens
get_input_price(model_id) * 0.1
end

# Determines the model family for a given model ID
# @param model_id [String] the model identifier
Expand Down
4 changes: 3 additions & 1 deletion lib/ruby_llm/providers/anthropic/chat.rb
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ def build_message(data, content, tool_use)
tool_calls: parse_tool_calls(tool_use),
input_tokens: data.dig('usage', 'input_tokens'),
output_tokens: data.dig('usage', 'output_tokens'),
cache_creation_input_tokens: data.dig('usage', 'cache_creation_input_tokens'),
cache_read_input_tokens: data.dig('usage', 'cache_read_input_tokens'),
model_id: data['model']
)
end
Expand All @@ -89,7 +91,7 @@ def format_message(msg)
def format_basic_message(msg)
{
role: convert_role(msg.role),
content: Media.format_content(msg.content)
content: Media.format_content(msg.content, msg.content.is_a?(Content) ? msg.content.cache_control : nil)
}
end

Expand Down
11 changes: 7 additions & 4 deletions lib/ruby_llm/providers/anthropic/media.rb
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ module Anthropic
module Media
module_function

def format_content(content) # rubocop:disable Metrics/MethodLength
def format_content(content, cache_control = nil) # rubocop:disable Metrics/MethodLength
return content unless content.is_a?(Array)

content.map do |part|
Expand All @@ -17,7 +17,7 @@ def format_content(content) # rubocop:disable Metrics/MethodLength
when 'pdf'
format_pdf(part)
when 'text'
format_text_block(part[:text])
format_text_block(part[:text], cache_control)
else
part
end
Expand Down Expand Up @@ -57,11 +57,14 @@ def format_pdf(part) # rubocop:disable Metrics/MethodLength
end
end

def format_text_block(text)
{
def format_text_block(text, cache_control = nil)
block = {
type: 'text',
text: text
}

block[:cache_control] = { type: 'ephemeral' } if cache_control
block
end
end
end
Expand Down
22 changes: 22 additions & 0 deletions lib/ruby_llm/providers/anthropic/streaming.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,31 @@ def build_chunk(data)
content: data.dig('delta', 'text'),
input_tokens: extract_input_tokens(data),
output_tokens: extract_output_tokens(data),
cache_creation_input_tokens: extract_cache_creation_tokens(data),
cache_read_input_tokens: extract_cache_read_tokens(data),
tool_calls: extract_tool_calls(data)
)
end

def extract_model_id(data)
data['model']
end

def extract_input_tokens(data)
data.dig('usage', 'input_tokens')
end

def extract_output_tokens(data)
data.dig('usage', 'output_tokens')
end

def extract_cache_creation_tokens(data)
data.dig('usage', 'cache_creation_input_tokens')
end

def extract_cache_read_tokens(data)
data.dig('usage', 'cache_read_input_tokens')
end

def json_delta?(data)
data['type'] == 'content_block_delta' && data.dig('delta', 'type') == 'input_json_delta'
Expand Down
Loading