-
-
Notifications
You must be signed in to change notification settings - Fork 277
Prompt Caching #234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tpaulshippy
wants to merge
59
commits into
crmne:main
Choose a base branch
from
tpaulshippy:prompt-caching
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Prompt Caching #234
Changes from 13 commits
Commits
Show all changes
59 commits
Select commit
Hold shift + click to select a range
2e84006
13: Failing specs
tpaulshippy be61e48
13: Get caching specs passing for Bedrock
tpaulshippy edec138
13: Remove comments in specs
tpaulshippy 971f176
13: Add unused param on other providers
tpaulshippy 557a5ee
13: Rubocop -A
tpaulshippy 9673b13
13: Add cassettes for bedrock cache specs
tpaulshippy c47d270
13: Resolve Rubocop aside from Metrics/ParameterLists
tpaulshippy eaf0876
13: Use large enough prompt to hit cache meaningfully
tpaulshippy 160d9ab
13: Ensure cache tokens are being used
tpaulshippy d1698bf
13: Refactor completion parameters
tpaulshippy 344729f
16: Add guide for prompt caching
tpaulshippy 7b98277
Add real anthropic cassettes ($0.03)
tpaulshippy fd30f14
Merge branch 'main' into prompt-caching
tpaulshippy a91d07e
Switch from large_prompt.txt to 10,000 of the letter a
tpaulshippy f40f37d
Make that 2048 * 4 (2048 tokens for Haiku)
tpaulshippy 109bb51
Rename properties on message class
tpaulshippy 1c6cbf7
Revert "13: Refactor completion parameters"
tpaulshippy 4d78a09
Address rubocop
tpaulshippy 25b3660
Merge remote-tracking branch 'origin/main' into prompt-caching
tpaulshippy 8e80f08
Update docs
tpaulshippy d42d074
Actually return the payload
tpaulshippy 97b1ace
Add support for cache token counts in gemini and openai
tpaulshippy 269122e
Merge branch 'main' into prompt-caching
tpaulshippy 2c88266
Improve specs to do double calls and check cached tokens
tpaulshippy 8c39dc1
Do the double call in the openai/gemini specs
tpaulshippy 24cdb63
Set cache control on last message only
tpaulshippy 97bde47
Merge branch 'main' into prompt-caching
tpaulshippy 8aff99a
Merge branch 'main' into prompt-caching
tpaulshippy 7c5d792
Fix some merge issues
tpaulshippy 2d49d5f
Get openai prompt cache reporting to work
tpaulshippy 013b527
Fix gemini prompt caching reporting
tpaulshippy 9dbdd12
Add comment about why gemini is special
tpaulshippy 5f6b9b3
Resolve rubocop offenses
tpaulshippy f591ab1
Merge branch 'main' into prompt-caching
tpaulshippy dd7abc9
Merge branch 'main' into prompt-caching
tpaulshippy ace160c
Merge branch 'main' into prompt-caching
tpaulshippy 74846b2
Merge branch 'main' into prompt-caching
tpaulshippy 91032de
Clean up the aaaaaaaaaaaa prompts in VCRs
tpaulshippy 05cc1d9
Reduce line length
tpaulshippy f861b63
Support caching in rails model
tpaulshippy f923385
Merge branch 'main' into prompt-caching
tpaulshippy 970deba
Merge branch 'main' into prompt-caching
tpaulshippy 010f889
Merge branch 'main' into prompt-caching
tpaulshippy 5c31698
Merge branch 'main' into prompt-caching
tpaulshippy 00e69ae
Merge branch 'main' into prompt-caching
tpaulshippy d6f36f3
Add with_provider_options and use that for opting into caching
tpaulshippy 31b8b0e
Remove unused hash and add example to doc
tpaulshippy 7b8b280
Remove extra unnecessary comment
tpaulshippy 9a0ec36
Update appraisal gemfiles
tpaulshippy d833156
Revert "Update appraisal gemfiles"
tpaulshippy c90d6cd
Revert "Remove extra unnecessary comment"
tpaulshippy a16d6dd
Revert "Remove unused hash and add example to doc"
tpaulshippy 2e586e1
Revert "Add with_provider_options and use that for opting into caching"
tpaulshippy 7f30f58
Merge branch 'main' into prompt-caching
tpaulshippy 4fdc805
Update docs to reflect new API
tpaulshippy f5c3825
Take cache setting as parameter
tpaulshippy 581a568
Update specs and refactor a bit
tpaulshippy 3da7f26
Get specs passing
tpaulshippy 7e6fa0d
Update appraisal gemfiles
tpaulshippy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,398 @@ | ||
--- | ||
layout: default | ||
title: Prompt Caching | ||
parent: Guides | ||
nav_order: 11 | ||
permalink: /guides/prompt-caching | ||
--- | ||
|
||
# Prompt Caching | ||
{: .no_toc } | ||
|
||
Prompt caching allows you to cache frequently used content like system instructions, large documents, or tool definitions to reduce costs and improve response times for subsequent requests. | ||
{: .fs-6 .fw-300 } | ||
|
||
## Table of contents | ||
{: .no_toc .text-delta } | ||
|
||
1. TOC | ||
{:toc} | ||
|
||
--- | ||
|
||
After reading this guide, you will know: | ||
|
||
* What prompt caching is and when to use it. | ||
* Which models and providers support prompt caching. | ||
* How to cache system instructions, user messages, and tool definitions. | ||
* How to track caching costs and token usage. | ||
* Best practices for maximizing cache efficiency. | ||
|
||
## What is Prompt Caching? | ||
|
||
Prompt caching allows AI providers to store and reuse parts of your prompts across multiple requests. When you mark content for caching, the provider stores it in a cache and can reuse it in subsequent requests without reprocessing, leading to: | ||
|
||
- **Cost savings**: Cached content is typically charged at 75-90% less than regular input tokens | ||
- **Faster responses**: Cached content doesn't need to be reprocessed | ||
- **Consistent context**: Large documents or instructions remain available across conversations | ||
|
||
{: .note } | ||
Prompt caching is currently supported in RubyLLM only for **Anthropic** and **Bedrock** (Anthropic models) providers. The cache is ephemeral and will not be available if not used after 5 minutes by default. | ||
|
||
Different models have different minimum numbers of tokens before caching kicks in but it usually takes around 1024 tokens worth of content. | ||
|
||
## Basic Usage | ||
|
||
Enable prompt caching using the `cache_prompts` method on your chat instance: | ||
|
||
```ruby | ||
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022') | ||
|
||
# Enable caching for different types of content | ||
chat.cache_prompts( | ||
system: true, # Cache system instructions | ||
user: true, # Cache user messages | ||
tools: true # Cache tool definitions | ||
) | ||
``` | ||
|
||
## Caching System Instructions | ||
|
||
System instructions are ideal for caching when you have lengthy guidelines, documentation, or context that remains consistent across multiple conversations. | ||
|
||
```ruby | ||
# Large system prompt that would benefit from caching | ||
CODING_GUIDELINES = <<~INSTRUCTIONS | ||
You are a senior Ruby developer and code reviewer. Follow these detailed guidelines: | ||
|
||
## Code Style Guidelines | ||
- Use 2 spaces for indentation, never tabs | ||
- Keep lines under 120 characters | ||
- Use descriptive variable and method names | ||
- Prefer explicit returns in methods | ||
- Use single quotes for strings unless interpolation is needed | ||
|
||
## Architecture Principles | ||
- Follow SOLID principles | ||
- Prefer composition over inheritance | ||
- Keep controllers thin, move logic to models or services | ||
- Use dependency injection for better testability | ||
|
||
## Testing Requirements | ||
- Write RSpec tests for all new functionality | ||
- Aim for 90%+ test coverage | ||
- Use factories instead of fixtures | ||
- Mock external dependencies | ||
|
||
## Security Considerations | ||
- Always validate and sanitize user input | ||
- Use strong parameters in controllers | ||
- Implement proper authentication and authorization | ||
- Never commit secrets or API keys | ||
|
||
## Performance Guidelines | ||
- Avoid N+1 queries, use includes/joins | ||
- Index database columns used in WHERE clauses | ||
- Use background jobs for long-running tasks | ||
- Cache expensive computations | ||
|
||
[... additional detailed guidelines ...] | ||
INSTRUCTIONS | ||
|
||
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022') | ||
chat.with_instructions(CODING_GUIDELINES) | ||
chat.cache_prompts(system: true) | ||
|
||
# First request creates the cache | ||
response = chat.ask("Review this Ruby method for potential improvements") | ||
puts "Cache creation tokens: #{response.cache_creation_input_tokens}" | ||
|
||
# Subsequent requests use the cached instructions | ||
response = chat.ask("What are the testing requirements for this project?") | ||
puts "Cache read tokens: #{response.cache_read_input_tokens}" | ||
``` | ||
|
||
## Caching Large Documents | ||
|
||
When working with large documents, user message caching can significantly reduce costs: | ||
|
||
```ruby | ||
# Load a large document (e.g., API documentation, legal contract, research paper) | ||
large_document = File.read('path/to/large_api_documentation.md') | ||
|
||
chat = RubyLLM.chat(model: 'claude-3-5-sonnet-20241022') | ||
chat.cache_prompts(user: true) | ||
|
||
# First request with the large document | ||
prompt = <<~PROMPT | ||
#{large_document} | ||
|
||
Based on the API documentation above, how do I authenticate with this service? | ||
PROMPT | ||
|
||
response = chat.ask(prompt) | ||
puts "Document cached. Creation tokens: #{response.cache_creation_input_tokens}" | ||
|
||
# Follow-up questions can reference the cached document | ||
response = chat.ask("What are the rate limits for this API?") | ||
puts "Using cached document. Read tokens: #{response.cache_read_input_tokens}" | ||
|
||
response = chat.ask("Show me an example of making a POST request to create a user") | ||
puts "Still using cache. Read tokens: #{response.cache_read_input_tokens}" | ||
``` | ||
|
||
## Caching Tool Definitions | ||
|
||
When using multiple complex tools, caching their definitions can reduce overhead: | ||
|
||
```ruby | ||
# Define complex tools with detailed descriptions | ||
class DatabaseQueryTool < RubyLLM::Tool | ||
description <<~DESC | ||
Execute SQL queries against the application database. This tool provides access to: | ||
|
||
- User management tables (users, profiles, permissions) | ||
- Content tables (posts, comments, media) | ||
- Analytics tables (events, metrics, reports) | ||
- Configuration tables (settings, features, experiments) | ||
|
||
Security notes: | ||
- Only SELECT queries are allowed | ||
- Results are limited to 1000 rows | ||
- Sensitive columns are automatically filtered | ||
- All queries are logged for audit purposes | ||
|
||
Usage examples: | ||
- Find active users: "SELECT * FROM users WHERE status = 'active'" | ||
- Get recent posts: "SELECT * FROM posts WHERE created_at > NOW() - INTERVAL 7 DAY" | ||
- Analyze user engagement: "SELECT COUNT(*) FROM events WHERE event_type = 'login'" | ||
DESC | ||
|
||
parameter :query, type: 'string', description: 'SQL query to execute' | ||
parameter :limit, type: 'integer', description: 'Maximum number of rows to return (default: 100)' | ||
|
||
def execute(query:, limit: 100) | ||
# Implementation here | ||
{ results: [], count: 0 } | ||
end | ||
end | ||
|
||
class FileSystemTool < RubyLLM::Tool | ||
description <<~DESC | ||
Access and manipulate files in the application directory. Capabilities include: | ||
|
||
- Reading file contents (text files only) | ||
- Listing directory contents | ||
- Searching for files by name or pattern | ||
- Getting file metadata (size, modified date, permissions) | ||
|
||
Restrictions: | ||
- Cannot access files outside the application directory | ||
- Cannot modify, create, or delete files | ||
- Binary files are not supported | ||
- Maximum file size: 10MB | ||
|
||
Supported file types: | ||
- Source code (.rb, .js, .py, .java, etc.) | ||
- Configuration files (.yml, .json, .xml, etc.) | ||
- Documentation (.md, .txt, .rst, etc.) | ||
- Log files (.log, .out, .err) | ||
DESC | ||
|
||
parameter :action, type: 'string', description: 'Action to perform: read, list, search, or info' | ||
parameter :path, type: 'string', description: 'File or directory path' | ||
parameter :pattern, type: 'string', description: 'Search pattern (for search action)' | ||
|
||
def execute(action:, path:, pattern: nil) | ||
# Implementation here | ||
{ action: action, path: path, result: 'success' } | ||
end | ||
end | ||
|
||
# Set up chat with tool caching | ||
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022') | ||
chat.with_tools(DatabaseQueryTool, FileSystemTool) | ||
chat.cache_prompts(tools: true) | ||
|
||
# First request creates the tool cache | ||
response = chat.ask("What tables are available in the database?") | ||
puts "Tools cached. Creation tokens: #{response.cache_creation_input_tokens}" | ||
|
||
# Subsequent requests use cached tool definitions | ||
response = chat.ask("Show me the structure of the users table") | ||
puts "Using cached tools. Read tokens: #{response.cache_read_input_tokens}" | ||
``` | ||
|
||
## Combining Multiple Cache Types | ||
|
||
You can cache different types of content simultaneously for maximum efficiency: | ||
|
||
```ruby | ||
# Large system context | ||
ANALYSIS_CONTEXT = <<~CONTEXT | ||
You are an expert data analyst working with e-commerce data. Your analysis should consider: | ||
|
||
## Business Metrics | ||
- Revenue and profit margins | ||
- Customer acquisition cost (CAC) | ||
- Customer lifetime value (CLV) | ||
- Conversion rates and funnel analysis | ||
|
||
## Data Quality Standards | ||
- Check for missing or inconsistent data | ||
- Validate data ranges and formats | ||
- Identify outliers and anomalies | ||
- Ensure temporal consistency | ||
|
||
## Reporting Guidelines | ||
- Use clear, business-friendly language | ||
- Include confidence intervals where appropriate | ||
- Highlight actionable insights | ||
- Provide recommendations with supporting evidence | ||
|
||
[... extensive analysis guidelines ...] | ||
CONTEXT | ||
|
||
# Load large dataset | ||
sales_data = File.read('path/to/large_sales_dataset.csv') | ||
|
||
chat = RubyLLM.chat(model: 'claude-3-5-sonnet-20241022') | ||
chat.with_instructions(ANALYSIS_CONTEXT) | ||
chat.with_tools(DatabaseQueryTool, FileSystemTool) | ||
|
||
# Enable caching for all content types | ||
chat.cache_prompts(system: true, user: true, tools: true) | ||
|
||
# First request caches everything | ||
prompt = <<~PROMPT | ||
#{sales_data} | ||
|
||
Analyze the sales data above and provide insights on revenue trends. | ||
PROMPT | ||
|
||
response = chat.ask(prompt) | ||
puts "All content cached:" | ||
puts " System cache: #{response.cache_creation_input_tokens} tokens" | ||
puts " Tools cached: #{chat.messages.any? { |m| m.cache_creation_input_tokens&.positive? }}" | ||
|
||
# Follow-up requests benefit from all cached content | ||
response = chat.ask("What are the top-performing product categories?") | ||
puts "Cache read tokens: #{response.cache_read_input_tokens}" | ||
|
||
response = chat.ask("Query the database to get customer segmentation data") | ||
puts "Cache read tokens: #{response.cache_read_input_tokens}" | ||
``` | ||
|
||
## Understanding Cache Metrics | ||
|
||
RubyLLM provides detailed metrics about cache usage in the response: | ||
|
||
```ruby | ||
chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022') | ||
chat.with_instructions("Large system prompt here...") | ||
chat.cache_prompts(system: true) | ||
|
||
response = chat.ask("Your question here") | ||
|
||
# Check if cache was created (first request) | ||
if response.cache_creation_input_tokens&.positive? | ||
puts "Cache created with #{response.cache_creation_input_tokens} tokens" | ||
puts "Regular input tokens: #{response.input_tokens - response.cache_creation_input_tokens}" | ||
end | ||
|
||
# Check if cache was used (subsequent requests) | ||
if response.cache_read_input_tokens&.positive? | ||
puts "Cache read: #{response.cache_read_input_tokens} tokens" | ||
puts "New input tokens: #{response.input_tokens - response.cache_read_input_tokens}" | ||
end | ||
|
||
# Total cost calculation (example with Claude pricing) | ||
cache_creation_cost = (response.cache_creation_input_tokens || 0) * 3.75 / 1_000_000 # $3.75 per 1M tokens | ||
cache_read_cost = (response.cache_read_input_tokens || 0) * 0.30 / 1_000_000 # $0.30 per 1M tokens | ||
regular_input_cost = (response.input_tokens - (response.cache_creation_input_tokens || 0) - (response.cache_read_input_tokens || 0)) * 3.00 / 1_000_000 | ||
output_cost = response.output_tokens * 15.00 / 1_000_000 | ||
|
||
total_cost = cache_creation_cost + cache_read_cost + regular_input_cost + output_cost | ||
puts "Total request cost: $#{total_cost.round(6)}" | ||
``` | ||
|
||
## Cost Optimization | ||
|
||
Prompt caching can significantly reduce costs for applications with repeated content: | ||
|
||
```ruby | ||
# Example cost comparison for Claude 3.5 Sonnet | ||
# Regular pricing: $3.00 per 1M input tokens | ||
# Cache creation: $3.75 per 1M tokens (25% premium) | ||
# Cache read: $0.30 per 1M tokens (90% discount) | ||
|
||
def calculate_savings(content_tokens, num_requests) | ||
# Without caching | ||
regular_cost = content_tokens * num_requests * 3.00 / 1_000_000 | ||
|
||
# With caching | ||
cache_creation_cost = content_tokens * 3.75 / 1_000_000 | ||
cache_read_cost = content_tokens * (num_requests - 1) * 0.30 / 1_000_000 | ||
cached_cost = cache_creation_cost + cache_read_cost | ||
|
||
savings = regular_cost - cached_cost | ||
savings_percentage = (savings / regular_cost * 100).round(1) | ||
|
||
puts "Content: #{content_tokens} tokens, #{num_requests} requests" | ||
puts "Regular cost: $#{regular_cost.round(4)}" | ||
puts "Cached cost: $#{cached_cost.round(4)}" | ||
puts "Savings: $#{savings.round(4)} (#{savings_percentage}%)" | ||
end | ||
|
||
# Examples | ||
calculate_savings(5000, 10) # 5K tokens, 10 requests | ||
calculate_savings(20000, 5) # 20K tokens, 5 requests | ||
calculate_savings(50000, 3) # 50K tokens, 3 requests | ||
``` | ||
|
||
## Troubleshooting | ||
|
||
### Cache Not Working | ||
If caching doesn't seem to be working: | ||
|
||
1. **Check model support**: Ensure you're using a supported model | ||
2. **Verify provider**: Only Anthropic and Bedrock support caching | ||
3. **Check content size**: Smaller content will not be cached - there is a minimum that varies per model | ||
4. **Monitor metrics**: Use `cache_creation_input_tokens` and `cache_read_input_tokens` | ||
|
||
```ruby | ||
response = chat.ask("Your question") | ||
|
||
if response.cache_creation_input_tokens.zero? && response.cache_read_input_tokens.zero? | ||
puts "No caching occurred. Check:" | ||
puts " Model: #{chat.model.id}" | ||
puts " Provider: #{chat.model.provider}" | ||
puts " Cache settings: #{chat.instance_variable_get(:@cache_prompts)}" | ||
end | ||
``` | ||
|
||
### Unexpected Cache Behavior | ||
Cache behavior can vary based on: | ||
|
||
- **Content changes**: Any modification invalidates the cache | ||
- **Cache expiration**: Caches are ephemeral and expire automatically | ||
- **Provider limits**: Each provider has different cache policies | ||
|
||
```ruby | ||
# Cache is invalidated by any content change | ||
chat.with_instructions("Original instructions") | ||
chat.cache_prompts(system: true) | ||
response1 = chat.ask("Question 1") # Creates cache | ||
|
||
chat.with_instructions("Modified instructions", replace: true) | ||
response2 = chat.ask("Question 2") # Creates new cache (old one invalidated) | ||
``` | ||
|
||
## What's Next? | ||
|
||
Now that you understand prompt caching, explore these related topics: | ||
|
||
* [Working with Models]({% link guides/models.md %}) - Learn about model capabilities and selection | ||
* [Using Tools]({% link guides/tools.md %}) - Understand tool definitions that can be cached | ||
* [Error Handling]({% link guides/error-handling.md %}) - Handle caching-related errors gracefully | ||
* [Rails Integration]({% link guides/rails.md %}) - Use caching in Rails applications |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.