crmne · arnodirlam · Jun 25, 2025 · Jun 25, 2025 · Jul 16, 2025 · Jul 18, 2025
diff --git a/README.md b/README.md
@@ -43,11 +43,12 @@ RubyLLM fixes all that. One beautiful API for everything. One consistent format.
 
 ```ruby
 # Just ask questions
-chat = RubyLLM.chat
+chat = RubyLLM.chat(model: "gemini-2.0-flash")
 chat.ask "What's the best way to learn Ruby?"
 
-# Analyze images, audio, documents, and text files
+# Analyze images, videos, audio, documents, and text files
 chat.ask "What's in this image?", with: "ruby_conf.jpg"
+chat.ask "What's happening in this video?", with: "presentation.mp4"
 chat.ask "Describe this meeting", with: "meeting.wav"
 chat.ask "Summarize this document", with: "contract.pdf"
 chat.ask "Explain this code", with: "app.rb"
@@ -88,7 +89,8 @@ chat.with_tool(Weather).ask "What's the weather in Berlin? (52.5200, 13.4050)"
 ## Core Capabilities
 
 *   💬 **Unified Chat:** Converse with models from OpenAI, Anthropic, Gemini, Bedrock, OpenRouter, DeepSeek, Ollama, or any OpenAI-compatible API using `RubyLLM.chat`.
-*   👁️ **Vision:** Analyze images within chats.
+*   👁️ **Vision:** Analyze images and documents within chats.
+*   🎞️ **Video:** Analyze videos within chats.
 *   🔊 **Audio:** Transcribe and understand audio content.
 *   📄 **Document Analysis:** Extract information from PDFs, text files, and other documents.
 *   🖼️ **Image Generation:** Create images with `RubyLLM.paint`.

diff --git a/docs/guides/chat.md b/docs/guides/chat.md
@@ -119,7 +119,7 @@ RubyLLM manages a registry of known models and their capabilities. For detailed
 
 ## Multi-modal Conversations
 
-Modern AI models can often process more than just text. RubyLLM provides a unified way to include images, audio, text files, and PDFs in your chat messages using the `with:` option in the `ask` method.
+Modern AI models can often process more than just text. RubyLLM provides a unified way to include images, videos, audio, text files, and PDFs in your chat messages using the `with:` option in the `ask` method.
 
 ### Working with Images
 
@@ -144,6 +144,30 @@ puts response.content
 
 RubyLLM handles converting the image source into the format required by the specific provider API.
 
+### Working with Videos
+
+You can also analyze video files or URLs with vision-capable models. RubyLLM will automatically detect video files and handle them appropriately.
+
+```ruby
+# Ask about a local video file
+chat = RubyLLM.chat(model: 'gemini-2.0-flash')
+response = chat.ask "What happens in this video?", with: "path/to/demo.mp4"
+puts response.content
+
+# Ask about a video from a URL
+response = chat.ask "Summarize the main events in this video.", with: "https://example.com/demo_video.mp4"
+puts response.content
+
+# Combine videos with other file types
+response = chat.ask "Analyze these files for visual content.", with: ["diagram.png", "demo.mp4", "notes.txt"]
+puts response.content
+```
+
+**Notes:**
+- Supported video formats include .mp4, .mov, .avi, .webm, and others (provider-dependent).
+- Only Google Gemini models currently support video input; check the [Available Models Guide]({% link guides/available-models.md %}) for details.
+- Large video files may be subject to size or duration limits imposed by the provider.
+
 ### Working with Audio
 
 Provide audio file paths to audio-capable models (like `gpt-4o-audio-preview`).
@@ -224,6 +248,7 @@ response = chat.ask "What's in this image?", with: { image: "photo.jpg" }
 
 **Supported file types:**
 - **Images:** .jpg, .jpeg, .png, .gif, .webp, .bmp
+- **Videos:** .mp4, .mov, .avi, .webm
 - **Audio:** .mp3, .wav, .m4a, .ogg, .flac
 - **Documents:** .pdf, .txt, .md, .csv, .json, .xml
 - **Code:** .rb, .py, .js, .html, .css (and many others)

diff --git a/docs/guides/models.md b/docs/guides/models.md
@@ -41,7 +41,7 @@ The registry stores crucial information about each model, including:
 *   **`name`**: A human-friendly name.
 *   **`context_window`**: Max input tokens (e.g., `128_000`).
 *   **`max_tokens`**: Max output tokens (e.g., `16_384`).
-*   **`supports_vision`**: If it can process images.
+*   **`supports_vision`**: If it can process images and videos.
 *   **`supports_functions`**: If it can use [Tools]({% link guides/tools.md %}).
 *   **`input_price_per_million`**: Cost in USD per 1 million input tokens.
 *   **`output_price_per_million`**: Cost in USD per 1 million output tokens.

diff --git a/docs/guides/rails.md b/docs/guides/rails.md
@@ -117,7 +117,7 @@ Run the migrations: `rails db:migrate`
 
 ### ActiveStorage Setup for Attachments (Optional)
 
-If you want to use attachments (images, audio, PDFs) with your AI chats, you need to set up ActiveStorage:
+If you want to use attachments (images, videos, audio, PDFs) with your AI chats, you need to set up ActiveStorage:
 
 ```bash
 # Only needed if you plan to use attachments
@@ -314,7 +314,7 @@ chat_record.ask("Analyze this file", with: params[:uploaded_file])
 chat_record.ask("What's in this document?", with: user.profile_document)
 ```
 
-The attachment API automatically detects file types based on file extension or content type, so you don't need to specify whether something is an image, audio file, PDF, or text document - RubyLLM figures it out for you!
+The attachment API automatically detects file types based on file extension or content type, so you don't need to specify whether something is an image, video, audio file, PDF, or text document - RubyLLM figures it out for you!
 
 ## Handling Persistence Edge Cases
 

diff --git a/docs/index.md b/docs/index.md
@@ -69,11 +69,12 @@ RubyLLM fixes all that. One beautiful API for everything. One consistent format.
 
 ```ruby
 # Just ask questions
-chat = RubyLLM.chat
+chat = RubyLLM.chat(model: "gemini-2.0-flash")
 chat.ask "What's the best way to learn Ruby?"
 
-# Analyze images, audio, documents, and text files
+# Analyze images, videos, audio, documents, and text files
 chat.ask "What's in this image?", with: "ruby_conf.jpg"
+chat.ask "What's happening in this video?", with: "presentation.mp4"
 chat.ask "Describe this meeting", with: "meeting.wav"
 chat.ask "Summarize this document", with: "contract.pdf"
 chat.ask "Explain this code", with: "app.rb"
@@ -114,7 +115,8 @@ chat.with_tool(Weather).ask "What's the weather in Berlin? (52.5200, 13.4050)"
 ## Core Capabilities
 
 *   💬 **Unified Chat:** Converse with models from OpenAI, Anthropic, Gemini, Bedrock, OpenRouter, DeepSeek, Ollama, or any OpenAI-compatible API using `RubyLLM.chat`.
-*   👁️ **Vision:** Analyze images within chats.
+*   👁️ **Vision:** Analyze images and documents within chats.
+*   🎞️ **Video:** Analyze videos within chats.
 *   🔊 **Audio:** Transcribe and understand audio content.
 *   📄 **Document Analysis:** Extract information from PDFs, text files, and other documents.
 *   🖼️ **Image Generation:** Create images with `RubyLLM.paint`.

diff --git a/lib/ruby_llm/attachment.rb b/lib/ruby_llm/attachment.rb
@@ -67,6 +67,7 @@ def encoded
 
     def type
       return :image if image?
+      return :video if video?
       return :audio if audio?
       return :pdf if pdf?
       return :text if text?
@@ -78,6 +79,10 @@ def image?
       RubyLLM::MimeType.image? mime_type
     end
 
+    def video?
+      RubyLLM::MimeType.video? mime_type
+    end
+
     def audio?
       RubyLLM::MimeType.audio? mime_type
     end

diff --git a/lib/ruby_llm/mime_type.rb b/lib/ruby_llm/mime_type.rb
@@ -15,6 +15,10 @@ def image?(type)
       type.start_with?('image/')
     end
 
+    def video?(type)
+      type.start_with?('video/')
+    end
+
     def audio?(type)
       type.start_with?('audio/')
     end

diff --git a/lib/ruby_llm/model/info.rb b/lib/ruby_llm/model/info.rb
@@ -9,6 +9,7 @@ module Model
     # Example:
     #   model = RubyLLM.models.find('gpt-4')
     #   model.supports_vision?          # => true
+    #   model.supports_video?           # => false
     #   model.supports_functions?       # => true
     #   model.input_price_per_million   # => 30.0
     class Info
@@ -54,6 +55,10 @@ def supports_vision?
         modalities.input.include?('image')
       end
 
+      def supports_video?
+        modalities.input.include?('video')
+      end
+
       def supports_functions?
         function_calling?
       end

diff --git a/lib/ruby_llm/providers/gemini/capabilities.rb b/lib/ruby_llm/providers/gemini/capabilities.rb
@@ -61,7 +61,7 @@ def output_price_for(model_id)
           context_window_for(model_id) > 128_000 ? base_price * 2 : base_price
         end
 
-        # Determines if the model supports vision (image/video) inputs
+        # Determines if the model supports vision (image/document) inputs
         # @param model_id [String] the model identifier
         # @return [Boolean] true if the model supports vision inputs
         def supports_vision?(model_id)
@@ -70,6 +70,13 @@ def supports_vision?(model_id)
           model_id.match?(/gemini|flash|pro|imagen/)
         end
 
+        # Determines if the model supports video inputs
+        # @param model_id [String] the model identifier
+        # @return [Boolean] true if the model supports video inputs
+        def supports_video?(model_id)
+          model_id.match?(/gemini/)
+        end
+
         # Determines if the model supports function calling
         # @param model_id [String] the model identifier
         # @return [Boolean] true if the model supports function calling
@@ -274,6 +281,9 @@ def modalities_for(model_id)
             modalities[:input] << 'pdf'
           end
 
+          # Video support
+          modalities[:input] << 'video' if supports_video?(model_id)
+
           # Audio support
           modalities[:input] << 'audio' if model_id.match?(/audio/)
 

diff --git a/spec/fixtures/ruby.mp4 b/spec/fixtures/ruby.mp4
diff --git a/...s/vcr_cassettes/chat_video_models_gemini_gemini-2_0-flash_can_understand_local_videos.yml b/...s/vcr_cassettes/chat_video_models_gemini_gemini-2_0-flash_can_understand_local_videos.yml
diff --git a/...t_video_models_gemini_gemini-2_0-flash_can_understand_remote_videos_without_extension.yml b/...t_video_models_gemini_gemini-2_0-flash_can_understand_remote_videos_without_extension.yml
diff --git a/spec/ruby_llm/chat_content_spec.rb b/spec/ruby_llm/chat_content_spec.rb
@@ -6,11 +6,13 @@
   include_context 'with configured RubyLLM'
 
   let(:image_path) { File.expand_path('../fixtures/ruby.png', __dir__) }
+  let(:video_path) { File.expand_path('../fixtures/ruby.mp4', __dir__) }
   let(:audio_path) { File.expand_path('../fixtures/ruby.wav', __dir__) }
   let(:pdf_path) { File.expand_path('../fixtures/sample.pdf', __dir__) }
   let(:text_path) { File.expand_path('../fixtures/ruby.txt', __dir__) }
   let(:xml_path) { File.expand_path('../fixtures/ruby.xml', __dir__) }
   let(:image_url) { 'https://upload.wikimedia.org/wikipedia/commons/f/f1/Ruby_logo.png' }
+  let(:video_url) { 'https://filesamples.com/samples/video/mp4/sample_640x360.mp4' }
   let(:audio_url) { 'https://commons.wikimedia.org/wiki/File:LL-Q1860_(eng)-AcpoKrane-ruby.wav' }
   let(:pdf_url) { 'https://pdfobject.com/pdf/sample.pdf' }
   let(:text_url) { 'https://www.ruby-lang.org/en/about/license.txt' }
@@ -95,6 +97,35 @@
     end
   end
 
+  describe 'video models' do # rubocop:disable RSpec/MultipleMemoizedHelpers
+    VIDEO_MODELS.each do |model_info|
+      provider = model_info[:provider]
+      model = model_info[:model]
+
+      it "#{provider}/#{model} can understand local videos" do # rubocop:disable RSpec/MultipleExpectations,RSpec/ExampleLength
+        chat = RubyLLM.chat(model: model, provider: provider)
+        response = chat.ask('What do you see in this video?', with: { video: video_path })
+
+        expect(response.content).to be_present
+        expect(response.content).not_to include('RubyLLM::Content')
+        expect(chat.messages.first.content).to be_a(RubyLLM::Content)
+        expect(chat.messages.first.content.attachments.first.filename).to eq('ruby.mp4')
+        expect(chat.messages.first.content.attachments.first.mime_type).to eq('video/mp4')
+      end
+
+      it "#{provider}/#{model} can understand remote videos without extension" do # rubocop:disable RSpec/MultipleExpectations,RSpec/ExampleLength
+        chat = RubyLLM.chat(model: model, provider: provider)
+        response = chat.ask('What do you see in this video?', with: video_url)
+
+        expect(response.content).to be_present
+        expect(response.content).not_to include('RubyLLM::Content')
+        expect(chat.messages.first.content).to be_a(RubyLLM::Content)
+        expect(chat.messages.first.content.attachments.first.filename).to eq('sample_640x360.mp4')
+        expect(chat.messages.first.content.attachments.first.mime_type).to eq('video/mp4')
+      end
+    end
+  end
+
   describe 'audio models' do # rubocop:disable RSpec/MultipleMemoizedHelpers
     AUDIO_MODELS.each do |model_info|
       model = model_info[:model]

diff --git a/spec/ruby_llm/models_spec.rb b/spec/ruby_llm/models_spec.rb
@@ -24,18 +24,24 @@
       expect(openai_chat_models.map(&:id).sort).to eq(chat_openai_models.map(&:id).sort)
     end
 
-    it 'supports Enumerable methods' do # rubocop:disable RSpec/MultipleExpectations
+    it 'supports Enumerable methods' do
       # Count models by provider
       provider_counts = RubyLLM.models.group_by(&:provider)
                                .transform_values(&:count)
 
       # There should be models from at least OpenAI and Anthropic
       expect(provider_counts.keys).to include('openai', 'anthropic')
+    end
 
-      # Select only models with vision support
+    it 'filters by vision support' do
       vision_models = RubyLLM.models.select(&:supports_vision?)
       expect(vision_models).to all(have_attributes(supports_vision?: true))
     end
+
+    it 'filters by video support' do
+      video_models = RubyLLM.models.select(&:supports_video?)
+      expect(video_models).to all(have_attributes(supports_video?: true))
+    end
   end
 
   describe 'finding models' do

diff --git a/spec/spec_helper.rb b/spec/spec_helper.rb
@@ -157,6 +157,10 @@
   { provider: :ollama, model: 'qwen3' }
 ].freeze
 
+VIDEO_MODELS = [
+  { provider: :gemini, model: 'gemini-2.0-flash' }
+].freeze
+
 AUDIO_MODELS = [
   { provider: :openai, model: 'gpt-4o-mini-audio-preview' }
 ].freeze