Skip to content
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
2e84006
13: Failing specs
tpaulshippy Jun 9, 2025
be61e48
13: Get caching specs passing for Bedrock
tpaulshippy Jun 9, 2025
edec138
13: Remove comments in specs
tpaulshippy Jun 9, 2025
971f176
13: Add unused param on other providers
tpaulshippy Jun 9, 2025
557a5ee
13: Rubocop -A
tpaulshippy Jun 9, 2025
9673b13
13: Add cassettes for bedrock cache specs
tpaulshippy Jun 9, 2025
c47d270
13: Resolve Rubocop aside from Metrics/ParameterLists
tpaulshippy Jun 9, 2025
eaf0876
13: Use large enough prompt to hit cache meaningfully
tpaulshippy Jun 9, 2025
160d9ab
13: Ensure cache tokens are being used
tpaulshippy Jun 9, 2025
d1698bf
13: Refactor completion parameters
tpaulshippy Jun 9, 2025
344729f
16: Add guide for prompt caching
tpaulshippy Jun 9, 2025
7b98277
Add real anthropic cassettes ($0.03)
tpaulshippy Jun 12, 2025
fd30f14
Merge branch 'main' into prompt-caching
tpaulshippy Jun 12, 2025
a91d07e
Switch from large_prompt.txt to 10,000 of the letter a
tpaulshippy Jul 19, 2025
f40f37d
Make that 2048 * 4 (2048 tokens for Haiku)
tpaulshippy Jul 19, 2025
109bb51
Rename properties on message class
tpaulshippy Jul 19, 2025
1c6cbf7
Revert "13: Refactor completion parameters"
tpaulshippy Jul 19, 2025
4d78a09
Address rubocop
tpaulshippy Jul 19, 2025
25b3660
Merge remote-tracking branch 'origin/main' into prompt-caching
tpaulshippy Jul 19, 2025
8e80f08
Update docs
tpaulshippy Jul 19, 2025
d42d074
Actually return the payload
tpaulshippy Jul 19, 2025
97b1ace
Add support for cache token counts in gemini and openai
tpaulshippy Jul 19, 2025
269122e
Merge branch 'main' into prompt-caching
tpaulshippy Jul 24, 2025
2c88266
Improve specs to do double calls and check cached tokens
tpaulshippy Jul 29, 2025
8c39dc1
Do the double call in the openai/gemini specs
tpaulshippy Jul 29, 2025
24cdb63
Set cache control on last message only
tpaulshippy Jul 29, 2025
97bde47
Merge branch 'main' into prompt-caching
tpaulshippy Jul 29, 2025
8aff99a
Merge branch 'main' into prompt-caching
tpaulshippy Aug 3, 2025
7c5d792
Fix some merge issues
tpaulshippy Aug 3, 2025
2d49d5f
Get openai prompt cache reporting to work
tpaulshippy Aug 3, 2025
013b527
Fix gemini prompt caching reporting
tpaulshippy Aug 3, 2025
9dbdd12
Add comment about why gemini is special
tpaulshippy Aug 3, 2025
5f6b9b3
Resolve rubocop offenses
tpaulshippy Aug 3, 2025
f591ab1
Merge branch 'main' into prompt-caching
tpaulshippy Aug 7, 2025
dd7abc9
Merge branch 'main' into prompt-caching
tpaulshippy Aug 8, 2025
ace160c
Merge branch 'main' into prompt-caching
tpaulshippy Aug 14, 2025
74846b2
Merge branch 'main' into prompt-caching
tpaulshippy Aug 14, 2025
91032de
Clean up the aaaaaaaaaaaa prompts in VCRs
tpaulshippy Aug 14, 2025
05cc1d9
Reduce line length
tpaulshippy Aug 14, 2025
f861b63
Support caching in rails model
tpaulshippy Aug 15, 2025
f923385
Merge branch 'main' into prompt-caching
tpaulshippy Aug 25, 2025
970deba
Merge branch 'main' into prompt-caching
tpaulshippy Aug 28, 2025
010f889
Merge branch 'main' into prompt-caching
tpaulshippy Aug 29, 2025
5c31698
Merge branch 'main' into prompt-caching
tpaulshippy Aug 30, 2025
00e69ae
Merge branch 'main' into prompt-caching
tpaulshippy Sep 8, 2025
d6f36f3
Add with_provider_options and use that for opting into caching
tpaulshippy Sep 8, 2025
31b8b0e
Remove unused hash and add example to doc
tpaulshippy Sep 8, 2025
7b8b280
Remove extra unnecessary comment
tpaulshippy Sep 8, 2025
9a0ec36
Update appraisal gemfiles
tpaulshippy Sep 8, 2025
d833156
Revert "Update appraisal gemfiles"
tpaulshippy Sep 22, 2025
c90d6cd
Revert "Remove extra unnecessary comment"
tpaulshippy Sep 22, 2025
a16d6dd
Revert "Remove unused hash and add example to doc"
tpaulshippy Sep 22, 2025
2e586e1
Revert "Add with_provider_options and use that for opting into caching"
tpaulshippy Sep 22, 2025
7f30f58
Merge branch 'main' into prompt-caching
tpaulshippy Sep 22, 2025
4fdc805
Update docs to reflect new API
tpaulshippy Sep 22, 2025
f5c3825
Take cache setting as parameter
tpaulshippy Sep 22, 2025
581a568
Update specs and refactor a bit
tpaulshippy Sep 22, 2025
3da7f26
Get specs passing
tpaulshippy Sep 22, 2025
7e6fa0d
Update appraisal gemfiles
tpaulshippy Sep 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
398 changes: 398 additions & 0 deletions docs/guides/prompt-caching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,398 @@
---
layout: default
title: Prompt Caching
parent: Guides
nav_order: 11
permalink: /guides/prompt-caching
---

# Prompt Caching
{: .no_toc }

Prompt caching allows you to cache frequently used content like system instructions, large documents, or tool definitions to reduce costs and improve response times for subsequent requests.
{: .fs-6 .fw-300 }

## Table of contents
{: .no_toc .text-delta }

1. TOC
{:toc}

---

After reading this guide, you will know:

* What prompt caching is and when to use it.
* Which models and providers support prompt caching.
* How to cache system instructions, user messages, and tool definitions.
* How to track caching costs and token usage.
* Best practices for maximizing cache efficiency.

## What is Prompt Caching?

Prompt caching allows AI providers to store and reuse parts of your prompts across multiple requests. When you mark content for caching, the provider stores it in a cache and can reuse it in subsequent requests without reprocessing, leading to:

- **Cost savings**: Cached content is typically charged at 75-90% less than regular input tokens
- **Faster responses**: Cached content doesn't need to be reprocessed
- **Consistent context**: Large documents or instructions remain available across conversations

{: .note }
Prompt caching is currently supported in RubyLLM only for **Anthropic** and **Bedrock** (Anthropic models) providers. The cache is ephemeral and will not be available if not used after 5 minutes by default.

Different models have different minimum numbers of tokens before caching kicks in but it usually takes around 1024 tokens worth of content.

## Basic Usage

Enable prompt caching using the `cache_prompts` method on your chat instance:

```ruby
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022')

# Enable caching for different types of content
chat.cache_prompts(
system: true, # Cache system instructions
user: true, # Cache user messages
tools: true # Cache tool definitions
)
```

## Caching System Instructions

System instructions are ideal for caching when you have lengthy guidelines, documentation, or context that remains consistent across multiple conversations.

```ruby
# Large system prompt that would benefit from caching
CODING_GUIDELINES = <<~INSTRUCTIONS
You are a senior Ruby developer and code reviewer. Follow these detailed guidelines:

## Code Style Guidelines
- Use 2 spaces for indentation, never tabs
- Keep lines under 120 characters
- Use descriptive variable and method names
- Prefer explicit returns in methods
- Use single quotes for strings unless interpolation is needed

## Architecture Principles
- Follow SOLID principles
- Prefer composition over inheritance
- Keep controllers thin, move logic to models or services
- Use dependency injection for better testability

## Testing Requirements
- Write RSpec tests for all new functionality
- Aim for 90%+ test coverage
- Use factories instead of fixtures
- Mock external dependencies

## Security Considerations
- Always validate and sanitize user input
- Use strong parameters in controllers
- Implement proper authentication and authorization
- Never commit secrets or API keys

## Performance Guidelines
- Avoid N+1 queries, use includes/joins
- Index database columns used in WHERE clauses
- Use background jobs for long-running tasks
- Cache expensive computations

[... additional detailed guidelines ...]
INSTRUCTIONS

chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022')
chat.with_instructions(CODING_GUIDELINES)
chat.cache_prompts(system: true)

# First request creates the cache
response = chat.ask("Review this Ruby method for potential improvements")
puts "Cache creation tokens: #{response.cache_creation_input_tokens}"

# Subsequent requests use the cached instructions
response = chat.ask("What are the testing requirements for this project?")
puts "Cache read tokens: #{response.cache_read_input_tokens}"
```

## Caching Large Documents

When working with large documents, user message caching can significantly reduce costs:

```ruby
# Load a large document (e.g., API documentation, legal contract, research paper)
large_document = File.read('path/to/large_api_documentation.md')

chat = RubyLLM.chat(model: 'claude-3-5-sonnet-20241022')
chat.cache_prompts(user: true)

# First request with the large document
prompt = <<~PROMPT
#{large_document}

Based on the API documentation above, how do I authenticate with this service?
PROMPT

response = chat.ask(prompt)
puts "Document cached. Creation tokens: #{response.cache_creation_input_tokens}"

# Follow-up questions can reference the cached document
response = chat.ask("What are the rate limits for this API?")
puts "Using cached document. Read tokens: #{response.cache_read_input_tokens}"

response = chat.ask("Show me an example of making a POST request to create a user")
puts "Still using cache. Read tokens: #{response.cache_read_input_tokens}"
```

## Caching Tool Definitions

When using multiple complex tools, caching their definitions can reduce overhead:

```ruby
# Define complex tools with detailed descriptions
class DatabaseQueryTool < RubyLLM::Tool
description <<~DESC
Execute SQL queries against the application database. This tool provides access to:

- User management tables (users, profiles, permissions)
- Content tables (posts, comments, media)
- Analytics tables (events, metrics, reports)
- Configuration tables (settings, features, experiments)

Security notes:
- Only SELECT queries are allowed
- Results are limited to 1000 rows
- Sensitive columns are automatically filtered
- All queries are logged for audit purposes

Usage examples:
- Find active users: "SELECT * FROM users WHERE status = 'active'"
- Get recent posts: "SELECT * FROM posts WHERE created_at > NOW() - INTERVAL 7 DAY"
- Analyze user engagement: "SELECT COUNT(*) FROM events WHERE event_type = 'login'"
DESC

parameter :query, type: 'string', description: 'SQL query to execute'
parameter :limit, type: 'integer', description: 'Maximum number of rows to return (default: 100)'

def execute(query:, limit: 100)
# Implementation here
{ results: [], count: 0 }
end
end

class FileSystemTool < RubyLLM::Tool
description <<~DESC
Access and manipulate files in the application directory. Capabilities include:

- Reading file contents (text files only)
- Listing directory contents
- Searching for files by name or pattern
- Getting file metadata (size, modified date, permissions)

Restrictions:
- Cannot access files outside the application directory
- Cannot modify, create, or delete files
- Binary files are not supported
- Maximum file size: 10MB

Supported file types:
- Source code (.rb, .js, .py, .java, etc.)
- Configuration files (.yml, .json, .xml, etc.)
- Documentation (.md, .txt, .rst, etc.)
- Log files (.log, .out, .err)
DESC

parameter :action, type: 'string', description: 'Action to perform: read, list, search, or info'
parameter :path, type: 'string', description: 'File or directory path'
parameter :pattern, type: 'string', description: 'Search pattern (for search action)'

def execute(action:, path:, pattern: nil)
# Implementation here
{ action: action, path: path, result: 'success' }
end
end

# Set up chat with tool caching
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022')
chat.with_tools(DatabaseQueryTool, FileSystemTool)
chat.cache_prompts(tools: true)

# First request creates the tool cache
response = chat.ask("What tables are available in the database?")
puts "Tools cached. Creation tokens: #{response.cache_creation_input_tokens}"

# Subsequent requests use cached tool definitions
response = chat.ask("Show me the structure of the users table")
puts "Using cached tools. Read tokens: #{response.cache_read_input_tokens}"
```

## Combining Multiple Cache Types

You can cache different types of content simultaneously for maximum efficiency:

```ruby
# Large system context
ANALYSIS_CONTEXT = <<~CONTEXT
You are an expert data analyst working with e-commerce data. Your analysis should consider:

## Business Metrics
- Revenue and profit margins
- Customer acquisition cost (CAC)
- Customer lifetime value (CLV)
- Conversion rates and funnel analysis

## Data Quality Standards
- Check for missing or inconsistent data
- Validate data ranges and formats
- Identify outliers and anomalies
- Ensure temporal consistency

## Reporting Guidelines
- Use clear, business-friendly language
- Include confidence intervals where appropriate
- Highlight actionable insights
- Provide recommendations with supporting evidence

[... extensive analysis guidelines ...]
CONTEXT

# Load large dataset
sales_data = File.read('path/to/large_sales_dataset.csv')

chat = RubyLLM.chat(model: 'claude-3-5-sonnet-20241022')
chat.with_instructions(ANALYSIS_CONTEXT)
chat.with_tools(DatabaseQueryTool, FileSystemTool)

# Enable caching for all content types
chat.cache_prompts(system: true, user: true, tools: true)

# First request caches everything
prompt = <<~PROMPT
#{sales_data}

Analyze the sales data above and provide insights on revenue trends.
PROMPT

response = chat.ask(prompt)
puts "All content cached:"
puts " System cache: #{response.cache_creation_input_tokens} tokens"
puts " Tools cached: #{chat.messages.any? { |m| m.cache_creation_input_tokens&.positive? }}"

# Follow-up requests benefit from all cached content
response = chat.ask("What are the top-performing product categories?")
puts "Cache read tokens: #{response.cache_read_input_tokens}"

response = chat.ask("Query the database to get customer segmentation data")
puts "Cache read tokens: #{response.cache_read_input_tokens}"
```

## Understanding Cache Metrics

RubyLLM provides detailed metrics about cache usage in the response:

```ruby
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022')
chat.with_instructions("Large system prompt here...")
chat.cache_prompts(system: true)

response = chat.ask("Your question here")

# Check if cache was created (first request)
if response.cache_creation_input_tokens&.positive?
puts "Cache created with #{response.cache_creation_input_tokens} tokens"
puts "Regular input tokens: #{response.input_tokens - response.cache_creation_input_tokens}"
end

# Check if cache was used (subsequent requests)
if response.cache_read_input_tokens&.positive?
puts "Cache read: #{response.cache_read_input_tokens} tokens"
puts "New input tokens: #{response.input_tokens - response.cache_read_input_tokens}"
end

# Total cost calculation (example with Claude pricing)
cache_creation_cost = (response.cache_creation_input_tokens || 0) * 3.75 / 1_000_000 # $3.75 per 1M tokens
cache_read_cost = (response.cache_read_input_tokens || 0) * 0.30 / 1_000_000 # $0.30 per 1M tokens
regular_input_cost = (response.input_tokens - (response.cache_creation_input_tokens || 0) - (response.cache_read_input_tokens || 0)) * 3.00 / 1_000_000
output_cost = response.output_tokens * 15.00 / 1_000_000

total_cost = cache_creation_cost + cache_read_cost + regular_input_cost + output_cost
puts "Total request cost: $#{total_cost.round(6)}"
```

## Cost Optimization

Prompt caching can significantly reduce costs for applications with repeated content:

```ruby
# Example cost comparison for Claude 3.5 Sonnet
# Regular pricing: $3.00 per 1M input tokens
# Cache creation: $3.75 per 1M tokens (25% premium)
# Cache read: $0.30 per 1M tokens (90% discount)

def calculate_savings(content_tokens, num_requests)
# Without caching
regular_cost = content_tokens * num_requests * 3.00 / 1_000_000

# With caching
cache_creation_cost = content_tokens * 3.75 / 1_000_000
cache_read_cost = content_tokens * (num_requests - 1) * 0.30 / 1_000_000
cached_cost = cache_creation_cost + cache_read_cost

savings = regular_cost - cached_cost
savings_percentage = (savings / regular_cost * 100).round(1)

puts "Content: #{content_tokens} tokens, #{num_requests} requests"
puts "Regular cost: $#{regular_cost.round(4)}"
puts "Cached cost: $#{cached_cost.round(4)}"
puts "Savings: $#{savings.round(4)} (#{savings_percentage}%)"
end

# Examples
calculate_savings(5000, 10) # 5K tokens, 10 requests
calculate_savings(20000, 5) # 20K tokens, 5 requests
calculate_savings(50000, 3) # 50K tokens, 3 requests
```

## Troubleshooting

### Cache Not Working
If caching doesn't seem to be working:

1. **Check model support**: Ensure you're using a supported model
2. **Verify provider**: Only Anthropic and Bedrock support caching
3. **Check content size**: Smaller content will not be cached - there is a minimum that varies per model
4. **Monitor metrics**: Use `cache_creation_input_tokens` and `cache_read_input_tokens`

```ruby
response = chat.ask("Your question")

if response.cache_creation_input_tokens.zero? && response.cache_read_input_tokens.zero?
puts "No caching occurred. Check:"
puts " Model: #{chat.model.id}"
puts " Provider: #{chat.model.provider}"
puts " Cache settings: #{chat.instance_variable_get(:@cache_prompts)}"
end
```

### Unexpected Cache Behavior
Cache behavior can vary based on:

- **Content changes**: Any modification invalidates the cache
- **Cache expiration**: Caches are ephemeral and expire automatically
- **Provider limits**: Each provider has different cache policies

```ruby
# Cache is invalidated by any content change
chat.with_instructions("Original instructions")
chat.cache_prompts(system: true)
response1 = chat.ask("Question 1") # Creates cache

chat.with_instructions("Modified instructions", replace: true)
response2 = chat.ask("Question 2") # Creates new cache (old one invalidated)
```

## What's Next?

Now that you understand prompt caching, explore these related topics:

* [Working with Models]({% link guides/models.md %}) - Learn about model capabilities and selection
* [Using Tools]({% link guides/tools.md %}) - Understand tool definitions that can be cached
* [Error Handling]({% link guides/error-handling.md %}) - Handle caching-related errors gracefully
* [Rails Integration]({% link guides/rails.md %}) - Use caching in Rails applications
Loading