Skip to content

Add video input file support #260

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,12 @@ RubyLLM fixes all that. One beautiful API for everything. One consistent format.

```ruby
# Just ask questions
chat = RubyLLM.chat
chat = RubyLLM.chat(model: "gemini-2.0-flash")
chat.ask "What's the best way to learn Ruby?"

# Analyze images, audio, documents, and text files
# Analyze images, videos, audio, documents, and text files
chat.ask "What's in this image?", with: "ruby_conf.jpg"
chat.ask "What's happening in this video?", with: "presentation.mp4"
chat.ask "Describe this meeting", with: "meeting.wav"
chat.ask "Summarize this document", with: "contract.pdf"
chat.ask "Explain this code", with: "app.rb"
Expand Down Expand Up @@ -88,7 +89,8 @@ chat.with_tool(Weather).ask "What's the weather in Berlin? (52.5200, 13.4050)"
## Core Capabilities

* 💬 **Unified Chat:** Converse with models from OpenAI, Anthropic, Gemini, Bedrock, OpenRouter, DeepSeek, Ollama, or any OpenAI-compatible API using `RubyLLM.chat`.
* 👁️ **Vision:** Analyze images within chats.
* 👁️ **Vision:** Analyze images and documents within chats.
* 🎞️ **Video:** Analyze videos within chats.
* 🔊 **Audio:** Transcribe and understand audio content.
* 📄 **Document Analysis:** Extract information from PDFs, text files, and other documents.
* 🖼️ **Image Generation:** Create images with `RubyLLM.paint`.
Expand Down
27 changes: 26 additions & 1 deletion docs/guides/chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ RubyLLM manages a registry of known models and their capabilities. For detailed

## Multi-modal Conversations

Modern AI models can often process more than just text. RubyLLM provides a unified way to include images, audio, text files, and PDFs in your chat messages using the `with:` option in the `ask` method.
Modern AI models can often process more than just text. RubyLLM provides a unified way to include images, videos, audio, text files, and PDFs in your chat messages using the `with:` option in the `ask` method.

### Working with Images

Expand All @@ -144,6 +144,30 @@ puts response.content

RubyLLM handles converting the image source into the format required by the specific provider API.

### Working with Videos

You can also analyze video files or URLs with vision-capable models. RubyLLM will automatically detect video files and handle them appropriately.

```ruby
# Ask about a local video file
chat = RubyLLM.chat(model: 'gemini-2.0-flash')
response = chat.ask "What happens in this video?", with: "path/to/demo.mp4"
puts response.content

# Ask about a video from a URL
response = chat.ask "Summarize the main events in this video.", with: "https://example.com/demo_video.mp4"
puts response.content

# Combine videos with other file types
response = chat.ask "Analyze these files for visual content.", with: ["diagram.png", "demo.mp4", "notes.txt"]
puts response.content
```

**Notes:**
- Supported video formats include .mp4, .mov, .avi, .webm, and others (provider-dependent).
- Only Google Gemini models currently support video input; check the [Available Models Guide]({% link guides/available-models.md %}) for details.
- Large video files may be subject to size or duration limits imposed by the provider.

### Working with Audio

Provide audio file paths to audio-capable models (like `gpt-4o-audio-preview`).
Expand Down Expand Up @@ -224,6 +248,7 @@ response = chat.ask "What's in this image?", with: { image: "photo.jpg" }

**Supported file types:**
- **Images:** .jpg, .jpeg, .png, .gif, .webp, .bmp
- **Videos:** .mp4, .mov, .avi, .webm
- **Audio:** .mp3, .wav, .m4a, .ogg, .flac
- **Documents:** .pdf, .txt, .md, .csv, .json, .xml
- **Code:** .rb, .py, .js, .html, .css (and many others)
Expand Down
2 changes: 1 addition & 1 deletion docs/guides/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ The registry stores crucial information about each model, including:
* **`name`**: A human-friendly name.
* **`context_window`**: Max input tokens (e.g., `128_000`).
* **`max_tokens`**: Max output tokens (e.g., `16_384`).
* **`supports_vision`**: If it can process images.
* **`supports_vision`**: If it can process images and videos.
* **`supports_functions`**: If it can use [Tools]({% link guides/tools.md %}).
* **`input_price_per_million`**: Cost in USD per 1 million input tokens.
* **`output_price_per_million`**: Cost in USD per 1 million output tokens.
Expand Down
4 changes: 2 additions & 2 deletions docs/guides/rails.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ Run the migrations: `rails db:migrate`

### ActiveStorage Setup for Attachments (Optional)

If you want to use attachments (images, audio, PDFs) with your AI chats, you need to set up ActiveStorage:
If you want to use attachments (images, videos, audio, PDFs) with your AI chats, you need to set up ActiveStorage:

```bash
# Only needed if you plan to use attachments
Expand Down Expand Up @@ -314,7 +314,7 @@ chat_record.ask("Analyze this file", with: params[:uploaded_file])
chat_record.ask("What's in this document?", with: user.profile_document)
```

The attachment API automatically detects file types based on file extension or content type, so you don't need to specify whether something is an image, audio file, PDF, or text document - RubyLLM figures it out for you!
The attachment API automatically detects file types based on file extension or content type, so you don't need to specify whether something is an image, video, audio file, PDF, or text document - RubyLLM figures it out for you!

## Handling Persistence Edge Cases

Expand Down
8 changes: 5 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,11 +69,12 @@ RubyLLM fixes all that. One beautiful API for everything. One consistent format.

```ruby
# Just ask questions
chat = RubyLLM.chat
chat = RubyLLM.chat(model: "gemini-2.0-flash")
chat.ask "What's the best way to learn Ruby?"

# Analyze images, audio, documents, and text files
# Analyze images, videos, audio, documents, and text files
chat.ask "What's in this image?", with: "ruby_conf.jpg"
chat.ask "What's happening in this video?", with: "presentation.mp4"
chat.ask "Describe this meeting", with: "meeting.wav"
chat.ask "Summarize this document", with: "contract.pdf"
chat.ask "Explain this code", with: "app.rb"
Expand Down Expand Up @@ -114,7 +115,8 @@ chat.with_tool(Weather).ask "What's the weather in Berlin? (52.5200, 13.4050)"
## Core Capabilities

* 💬 **Unified Chat:** Converse with models from OpenAI, Anthropic, Gemini, Bedrock, OpenRouter, DeepSeek, Ollama, or any OpenAI-compatible API using `RubyLLM.chat`.
* 👁️ **Vision:** Analyze images within chats.
* 👁️ **Vision:** Analyze images and documents within chats.
* 🎞️ **Video:** Analyze videos within chats.
* 🔊 **Audio:** Transcribe and understand audio content.
* 📄 **Document Analysis:** Extract information from PDFs, text files, and other documents.
* 🖼️ **Image Generation:** Create images with `RubyLLM.paint`.
Expand Down
5 changes: 5 additions & 0 deletions lib/ruby_llm/attachment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ def encoded

def type
return :image if image?
return :video if video?
return :audio if audio?
return :pdf if pdf?
return :text if text?
Expand All @@ -78,6 +79,10 @@ def image?
RubyLLM::MimeType.image? mime_type
end

def video?
RubyLLM::MimeType.video? mime_type
end

def audio?
RubyLLM::MimeType.audio? mime_type
end
Expand Down
4 changes: 4 additions & 0 deletions lib/ruby_llm/mime_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ def image?(type)
type.start_with?('image/')
end

def video?(type)
type.start_with?('video/')
end

def audio?(type)
type.start_with?('audio/')
end
Expand Down
5 changes: 5 additions & 0 deletions lib/ruby_llm/model/info.rb
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ module Model
# Example:
# model = RubyLLM.models.find('gpt-4')
# model.supports_vision? # => true
# model.supports_video? # => false
# model.supports_functions? # => true
# model.input_price_per_million # => 30.0
class Info
Expand Down Expand Up @@ -54,6 +55,10 @@ def supports_vision?
modalities.input.include?('image')
end

def supports_video?
modalities.input.include?('video')
end

def supports_functions?
function_calling?
end
Expand Down
12 changes: 11 additions & 1 deletion lib/ruby_llm/providers/gemini/capabilities.rb
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def output_price_for(model_id)
context_window_for(model_id) > 128_000 ? base_price * 2 : base_price
end

# Determines if the model supports vision (image/video) inputs
# Determines if the model supports vision (image/document) inputs
# @param model_id [String] the model identifier
# @return [Boolean] true if the model supports vision inputs
def supports_vision?(model_id)
Expand All @@ -70,6 +70,13 @@ def supports_vision?(model_id)
model_id.match?(/gemini|flash|pro|imagen/)
end

# Determines if the model supports video inputs
# @param model_id [String] the model identifier
# @return [Boolean] true if the model supports video inputs
def supports_video?(model_id)
model_id.match?(/gemini/)
end

# Determines if the model supports function calling
# @param model_id [String] the model identifier
# @return [Boolean] true if the model supports function calling
Expand Down Expand Up @@ -274,6 +281,9 @@ def modalities_for(model_id)
modalities[:input] << 'pdf'
end

# Video support
modalities[:input] << 'video' if supports_video?(model_id)

# Audio support
modalities[:input] << 'audio' if model_id.match?(/audio/)

Expand Down
Binary file added spec/fixtures/ruby.mp4
Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

31 changes: 31 additions & 0 deletions spec/ruby_llm/chat_content_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,13 @@
include_context 'with configured RubyLLM'

let(:image_path) { File.expand_path('../fixtures/ruby.png', __dir__) }
let(:video_path) { File.expand_path('../fixtures/ruby.mp4', __dir__) }
let(:audio_path) { File.expand_path('../fixtures/ruby.wav', __dir__) }
let(:pdf_path) { File.expand_path('../fixtures/sample.pdf', __dir__) }
let(:text_path) { File.expand_path('../fixtures/ruby.txt', __dir__) }
let(:xml_path) { File.expand_path('../fixtures/ruby.xml', __dir__) }
let(:image_url) { 'https://upload.wikimedia.org/wikipedia/commons/f/f1/Ruby_logo.png' }
let(:video_url) { 'https://filesamples.com/samples/video/mp4/sample_640x360.mp4' }
let(:audio_url) { 'https://commons.wikimedia.org/wiki/File:LL-Q1860_(eng)-AcpoKrane-ruby.wav' }
let(:pdf_url) { 'https://pdfobject.com/pdf/sample.pdf' }
let(:text_url) { 'https://www.ruby-lang.org/en/about/license.txt' }
Expand Down Expand Up @@ -95,6 +97,35 @@
end
end

describe 'video models' do # rubocop:disable RSpec/MultipleMemoizedHelpers
VIDEO_MODELS.each do |model_info|
provider = model_info[:provider]
model = model_info[:model]

it "#{provider}/#{model} can understand local videos" do # rubocop:disable RSpec/MultipleExpectations,RSpec/ExampleLength
chat = RubyLLM.chat(model: model, provider: provider)
response = chat.ask('What do you see in this video?', with: { video: video_path })

expect(response.content).to be_present
expect(response.content).not_to include('RubyLLM::Content')
expect(chat.messages.first.content).to be_a(RubyLLM::Content)
expect(chat.messages.first.content.attachments.first.filename).to eq('ruby.mp4')
expect(chat.messages.first.content.attachments.first.mime_type).to eq('video/mp4')
end

it "#{provider}/#{model} can understand remote videos without extension" do # rubocop:disable RSpec/MultipleExpectations,RSpec/ExampleLength
chat = RubyLLM.chat(model: model, provider: provider)
response = chat.ask('What do you see in this video?', with: video_url)

expect(response.content).to be_present
expect(response.content).not_to include('RubyLLM::Content')
expect(chat.messages.first.content).to be_a(RubyLLM::Content)
expect(chat.messages.first.content.attachments.first.filename).to eq('sample_640x360.mp4')
expect(chat.messages.first.content.attachments.first.mime_type).to eq('video/mp4')
end
end
end

describe 'audio models' do # rubocop:disable RSpec/MultipleMemoizedHelpers
AUDIO_MODELS.each do |model_info|
model = model_info[:model]
Expand Down
10 changes: 8 additions & 2 deletions spec/ruby_llm/models_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,24 @@
expect(openai_chat_models.map(&:id).sort).to eq(chat_openai_models.map(&:id).sort)
end

it 'supports Enumerable methods' do # rubocop:disable RSpec/MultipleExpectations
it 'supports Enumerable methods' do
# Count models by provider
provider_counts = RubyLLM.models.group_by(&:provider)
.transform_values(&:count)

# There should be models from at least OpenAI and Anthropic
expect(provider_counts.keys).to include('openai', 'anthropic')
end

# Select only models with vision support
it 'filters by vision support' do
vision_models = RubyLLM.models.select(&:supports_vision?)
expect(vision_models).to all(have_attributes(supports_vision?: true))
end

it 'filters by video support' do
video_models = RubyLLM.models.select(&:supports_video?)
expect(video_models).to all(have_attributes(supports_video?: true))
end
end

describe 'finding models' do
Expand Down
4 changes: 4 additions & 0 deletions spec/spec_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,10 @@
{ provider: :ollama, model: 'qwen3' }
].freeze

VIDEO_MODELS = [
{ provider: :gemini, model: 'gemini-2.0-flash' }
].freeze

AUDIO_MODELS = [
{ provider: :openai, model: 'gpt-4o-mini-audio-preview' }
].freeze