crmne · adenta · Apr 14, 2025 · adenta · Apr 14, 2025 · adenta
diff --git a/docs/_data/navigation.yml b/docs/_data/navigation.yml
@@ -13,6 +13,8 @@
       url: /guides/tools
     - title: Streaming
       url: /guides/streaming
+    - title: Prompt Caching
+      url: /guides/prompt-caching
     - title: Rails Integration
       url: /guides/rails
     - title: Image Generation

diff --git a/docs/guides/index.md b/docs/guides/index.md
@@ -13,30 +13,43 @@ This section contains detailed guides to help you make the most of RubyLLM. Each
 ## Available Guides
 
 ### [Getting Started]({% link guides/getting-started.md %})
+
 Learn the basics of RubyLLM and get up and running quickly with simple examples.
 
 ### [Chat]({% link guides/chat.md %})
+
 Explore the chat interface, which is the primary way to interact with AI models through RubyLLM.
 
 ### [Tools]({% link guides/tools.md %})
+
 Learn how to extend AI capabilities by creating tools that let models call your Ruby code.
 
 ### [Streaming]({% link guides/streaming.md %})
+
 Understand how to use streaming responses for real-time interactions.
 
+### [Prompt Caching]({% link guides/prompt-caching.md %})
+
+Learn how to use Anthropic's prompt caching feature to reduce token usage and costs.
+
 ### [Rails Integration]({% link guides/rails.md %})
+
 See how to integrate RubyLLM with Rails applications, including ActiveRecord persistence.
 
 ### [Image Generation]({% link guides/image-generation.md %})
+
 Learn how to generate images using DALL-E and other providers.
 
 ### [Embeddings]({% link guides/embeddings.md %})
+
 Explore how to create vector embeddings for semantic search and other applications.
 
 ### [Error Handling]({% link guides/error-handling.md %})
+
 Master the techniques for robust error handling in AI applications.
 
 ### [Working with Models]({% link guides/models.md %})
+
 Learn how to discover, select, and work with different AI models across providers.
 
 ## Getting Help
@@ -45,4 +58,4 @@ If you can't find what you're looking for in these guides, consider:
 
 1. Checking the [API Documentation]() for detailed information about specific classes and methods
 2. Looking at the [GitHub repository](https://github.com/crmne/ruby_llm) for examples and the latest updates
-3. Filing an issue on GitHub if you find a bug or have a feature request
+3. Filing an issue on GitHub if you find a bug or have a feature request
diff --git a/docs/guides/prompt-caching.md b/docs/guides/prompt-caching.md
@@ -0,0 +1,137 @@
+# Prompt Caching
+
+RubyLLM supports Anthropic's prompt caching feature, which allows you to cache parts of your prompts to reduce token usage and costs when making similar requests.
+
+## What is Prompt Caching?
+
+Prompt caching is a feature that allows you to mark specific parts of your prompt as cacheable. When you make a request with a cached prompt, Anthropic will:
+
+1. Check if the prompt prefix (up to the cache breakpoint) is already cached
+2. If found, use the cached version, reducing processing time and costs
+3. Otherwise, process the full prompt and cache the prefix
+
+This is especially useful for:
+
+- Prompts with many examples
+- Large amounts of context or background information
+- Repetitive tasks with consistent instructions
+- Long multi-turn conversations
+
+## Supported Models
+
+Prompt caching is currently supported on the following Anthropic Claude models:
+
+- Claude 3.7 Sonnet
+- Claude 3.5 Sonnet
+- Claude 3.5 Haiku
+- Claude 3 Haiku
+- Claude 3 Opus
+
+## How to Use Prompt Caching
+
+To use prompt caching in RubyLLM, you can mark content as cacheable using the `cache_control` parameter:
+
+```ruby
+# Create a chat with a Claude model
+chat = RubyLLM.chat(model: 'claude-3-5-sonnet')
+
+# Add a system message with cache control
+chat.with_instructions("You are an AI assistant tasked with analyzing literary works.",
+                      cache_control: true)
+
+# Add a large document with cache control
+chat.ask("Here's the entire text of Pride and Prejudice: [long text...]",
+         with: { cache_control: true })
+
+# Now you can ask questions about the document without reprocessing it
+chat.ask("Analyze the major themes in Pride and Prejudice.")
+```
+
+## Pricing
+
+Prompt caching introduces a different pricing structure:
+
+| Model             | Base Input Tokens | Cache Writes  | Cache Hits   | Output Tokens |
+| ----------------- | ----------------- | ------------- | ------------ | ------------- |
+| Claude 3.7 Sonnet | $3 / MTok         | $3.75 / MTok  | $0.30 / MTok | $15 / MTok    |
+| Claude 3.5 Sonnet | $3 / MTok         | $3.75 / MTok  | $0.30 / MTok | $15 / MTok    |
+| Claude 3.5 Haiku  | $0.80 / MTok      | $1 / MTok     | $0.08 / MTok | $4 / MTok     |
+| Claude 3 Haiku    | $0.25 / MTok      | $0.30 / MTok  | $0.03 / MTok | $1.25 / MTok  |
+| Claude 3 Opus     | $15 / MTok        | $18.75 / MTok | $1.50 / MTok | $75 / MTok    |
+
+Note:
+
+- Cache write tokens are 25% more expensive than base input tokens
+- Cache read tokens are 90% cheaper than base input tokens
+- Regular input and output tokens are priced at standard rates
+
+## Tracking Cache Performance
+
+When using prompt caching, you can track the cache performance using the following fields in the response:
+
+```ruby
+response = chat.ask("What are the main characters in Pride and Prejudice?")
+
+puts "Cache creation tokens: #{response.cache_creation_input_tokens}"
+puts "Cache read tokens: #{response.cache_read_input_tokens}"
+puts "Regular input tokens: #{response.input_tokens}"
+puts "Output tokens: #{response.output_tokens}"
+```
+
+## Cache Limitations
+
+- The minimum cacheable prompt length is:
+  - 1024 tokens for Claude 3.7 Sonnet, Claude 3.5 Sonnet, and Claude 3 Opus
+  - 2048 tokens for Claude 3.5 Haiku and Claude 3 Haiku
+- Shorter prompts cannot be cached, even if marked with `cache_control`
+- The cache has a minimum 5-minute lifetime
+- Cache hits require 100% identical prompt segments
+
+## Best Practices
+
+- Place static content (system instructions, context, examples) at the beginning of your prompt
+- Mark the end of the reusable content for caching using the `cache_control` parameter
+- Use cache breakpoints strategically to separate different cacheable prefix sections
+- Regularly analyze cache hit rates and adjust your strategy as needed
+
+## Example: Document Analysis
+
+```ruby
+# Create a chat with Claude
+chat = RubyLLM.chat(model: 'claude-3-5-sonnet')
+
+# Add system instructions with cache control
+chat.with_instructions("You are an AI assistant tasked with analyzing documents.",
+                      cache_control: true)
+
+# Add a PDF document with cache control
+chat.ask("Please analyze this document:",
+         with: { pdf: "large_document.pdf", cache_control: true })
+
+# First query - will create a cache
+response1 = chat.ask("What are the main points in the executive summary?")
+puts "Cache creation tokens: #{response1.cache_creation_input_tokens}"
+
+# Second query - will use the cache
+response2 = chat.ask("Who are the key stakeholders mentioned?")
+puts "Cache read tokens: #{response2.cache_read_input_tokens}"
+```
+
+## Example: Multi-turn Conversation
+
+```ruby
+# Create a chat with Claude
+chat = RubyLLM.chat(model: 'claude-3-5-sonnet')
+
+# Add system instructions with cache control
+chat.with_instructions("You are a helpful coding assistant. Use these coding conventions: [long list of conventions]",
+                      cache_control: true)
+
+# First query - will create a cache
+response1 = chat.ask("How do I write a Ruby class for a bank account?")
+puts "Cache creation tokens: #{response1.cache_creation_input_tokens}"
+
+# Second query - will use the cache
+response2 = chat.ask("Can you show me how to add a transfer method to that class?")
+puts "Cache read tokens: #{response2.cache_read_input_tokens}"
+```
diff --git a/lib/ruby_llm/chat.rb b/lib/ruby_llm/chat.rb
@@ -26,16 +26,22 @@ def initialize(model: nil, provider: nil)
     end
 
     def ask(message = nil, with: {}, &block)
-      add_message role: :user, content: Content.new(message, with)
+      # Extract cache_control from the with hash if present
+      cache_control = with.delete(:cache_control)
+
+      # Create a new Content object with the message and attachments
+      content = Content.new(message, with.merge(cache_control: cache_control))
+
+      add_message role: :user, content: content
       complete(&block)
     end
 
     alias say ask
 
-    def with_instructions(instructions, replace: false)
+    def with_instructions(instructions, replace: false, cache_control: nil)
       @messages = @messages.reject! { |msg| msg.role == :system } if replace
 
-      add_message role: :system, content: instructions
+      add_message role: :system, content: Content.new(instructions, cache_control: cache_control)
       self
     end
 

diff --git a/lib/ruby_llm/content.rb b/lib/ruby_llm/content.rb
@@ -5,9 +5,12 @@ module RubyLLM
   # Stores data in a standard internal format, letting providers
   # handle their own formatting needs.
   class Content
+    attr_reader :cache_control
+
     def initialize(text = nil, attachments = {}) # rubocop:disable Metrics/AbcSize,Metrics/MethodLength
       @parts = []
       @parts << { type: 'text', text: text } unless text.nil? || text.empty?
+      @cache_control = attachments[:cache_control]
 
       Array(attachments[:image]).each do |source|
         @parts << attach_image(source)
@@ -29,7 +32,7 @@ def to_a
     end
 
     def format
-      return @parts.first[:text] if @parts.size == 1 && @parts.first[:type] == 'text'
+      return @parts.first[:text] if @parts.size == 1 && @parts.first[:type] == 'text' && @cache_control.nil?
 
       to_a
     end

diff --git a/lib/ruby_llm/message.rb b/lib/ruby_llm/message.rb
@@ -7,7 +7,8 @@ module RubyLLM
   class Message
     ROLES = %i[system user assistant tool].freeze
 
-    attr_reader :role, :content, :tool_calls, :tool_call_id, :input_tokens, :output_tokens, :model_id
+    attr_reader :role, :content, :tool_calls, :tool_call_id, :input_tokens, :output_tokens, :model_id,
+                :cache_creation_input_tokens, :cache_read_input_tokens
 
     def initialize(options = {})
       @role = options[:role].to_sym
@@ -17,6 +18,8 @@ def initialize(options = {})
       @output_tokens = options[:output_tokens]
       @model_id = options[:model_id]
       @tool_call_id = options[:tool_call_id]
+      @cache_creation_input_tokens = options[:cache_creation_input_tokens]
+      @cache_read_input_tokens = options[:cache_read_input_tokens]
 
       ensure_valid_role
     end
@@ -41,7 +44,9 @@ def to_h
         tool_call_id: tool_call_id,
         input_tokens: input_tokens,
         output_tokens: output_tokens,
-        model_id: model_id
+        model_id: model_id,
+        cache_creation_input_tokens: cache_creation_input_tokens,
+        cache_read_input_tokens: cache_read_input_tokens
       }.compact
     end
 

diff --git a/lib/ruby_llm/providers/anthropic/capabilities.rb b/lib/ruby_llm/providers/anthropic/capabilities.rb
@@ -67,6 +67,29 @@ def supports_json_mode?(model_id)
         def supports_extended_thinking?(model_id)
           model_id.match?(/claude-3-7-sonnet/)
         end
+
+        # Determines if a model supports prompt caching
+        # @param model_id [String] the model identifier
+        # @return [Boolean] true if the model supports prompt caching
+        def supports_caching?(model_id)
+          model_id.match?(/claude-3(?:-[357])?(?:-(?:opus|sonnet|haiku))/)
+        end
+
+        # Gets the cache write price per million tokens for a given model
+        # @param model_id [String] the model identifier
+        # @return [Float] the price per million tokens for cache writes
+        def cache_write_price_for(model_id)
+          # Cache write tokens are 25% more expensive than base input tokens
+          get_input_price(model_id) * 1.25
+        end
+
+        # Gets the cache hit price per million tokens for a given model
+        # @param model_id [String] the model identifier
+        # @return [Float] the price per million tokens for cache hits
+        def cache_hit_price_for(model_id)
+          # Cache read tokens are 90% cheaper than base input tokens
+          get_input_price(model_id) * 0.1
+        end
 
         # Determines the model family for a given model ID
         # @param model_id [String] the model identifier

diff --git a/lib/ruby_llm/providers/anthropic/chat.rb b/lib/ruby_llm/providers/anthropic/chat.rb
@@ -72,6 +72,8 @@ def build_message(data, content, tool_use)
             tool_calls: parse_tool_calls(tool_use),
             input_tokens: data.dig('usage', 'input_tokens'),
             output_tokens: data.dig('usage', 'output_tokens'),
+            cache_creation_input_tokens: data.dig('usage', 'cache_creation_input_tokens'),
+            cache_read_input_tokens: data.dig('usage', 'cache_read_input_tokens'),
             model_id: data['model']
           )
         end
@@ -89,7 +91,7 @@ def format_message(msg)
         def format_basic_message(msg)
           {
             role: convert_role(msg.role),
-            content: Media.format_content(msg.content)
+            content: Media.format_content(msg.content, msg.content.is_a?(Content) ? msg.content.cache_control : nil)
           }
         end
 

diff --git a/lib/ruby_llm/providers/anthropic/media.rb b/lib/ruby_llm/providers/anthropic/media.rb
@@ -7,7 +7,7 @@ module Anthropic
       module Media
         module_function
 
-        def format_content(content) # rubocop:disable Metrics/MethodLength
+        def format_content(content, cache_control = nil) # rubocop:disable Metrics/MethodLength
           return content unless content.is_a?(Array)
 
           content.map do |part|
@@ -17,7 +17,7 @@ def format_content(content) # rubocop:disable Metrics/MethodLength
             when 'pdf'
               format_pdf(part)
             when 'text'
-              format_text_block(part[:text])
+              format_text_block(part[:text], cache_control)
             else
               part
             end
@@ -57,11 +57,14 @@ def format_pdf(part) # rubocop:disable Metrics/MethodLength
           end
         end
 
-        def format_text_block(text)
-          {
+        def format_text_block(text, cache_control = nil)
+          block = {
             type: 'text',
             text: text
           }
+
+          block[:cache_control] = { type: 'ephemeral' } if cache_control
+          block
         end
       end
     end

diff --git a/lib/ruby_llm/providers/anthropic/streaming.rb b/lib/ruby_llm/providers/anthropic/streaming.rb
@@ -18,9 +18,31 @@ def build_chunk(data)
             content: data.dig('delta', 'text'),
             input_tokens: extract_input_tokens(data),
             output_tokens: extract_output_tokens(data),
+            cache_creation_input_tokens: extract_cache_creation_tokens(data),
+            cache_read_input_tokens: extract_cache_read_tokens(data),
             tool_calls: extract_tool_calls(data)
           )
         end
+
+        def extract_model_id(data)
+          data['model']
+        end
+
+        def extract_input_tokens(data)
+          data.dig('usage', 'input_tokens')
+        end
+
+        def extract_output_tokens(data)
+          data.dig('usage', 'output_tokens')
+        end
+
+        def extract_cache_creation_tokens(data)
+          data.dig('usage', 'cache_creation_input_tokens')
+        end
+
+        def extract_cache_read_tokens(data)
+          data.dig('usage', 'cache_read_input_tokens')
+        end
 
         def json_delta?(data)
           data['type'] == 'content_block_delta' && data.dig('delta', 'type') == 'input_json_delta'