Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ chat.ask "What's the best way to learn Ruby?"
```ruby
# Analyze any file type
chat.ask "What's in this image?", with: "ruby_conf.jpg"
chat.ask "What's happening in this video?", with: "video.mp4"
chat.ask "Describe this meeting", with: "meeting.wav"
chat.ask "Summarize this document", with: "contract.pdf"
chat.ask "Explain this code", with: "app.rb"
Expand Down Expand Up @@ -100,7 +101,7 @@ response = chat.with_schema(ProductSchema).ask "Analyze this product", with: "pr
## Features

* **Chat:** Conversational AI with `RubyLLM.chat`
* **Vision:** Analyze images and screenshots
* **Vision:** Analyze images and videos
* **Audio:** Transcribe and understand speech
* **Documents:** Extract from PDFs, CSVs, JSON, any file type
* **Image generation:** Create images with `RubyLLM.paint`
Expand Down
4 changes: 2 additions & 2 deletions docs/_advanced/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ The registry stores crucial information about each model, including:
* **`name`**: A human-friendly name.
* **`context_window`**: Max input tokens (e.g., `128_000`).
* **`max_tokens`**: Max output tokens (e.g., `16_384`).
* **`supports_vision`**: If it can process images.
* **`supports_vision`**: If it can process images and videos.
* **`supports_functions`**: If it can use [Tools]({% link _core_features/tools.md %}).
* **`input_price_per_million`**: Cost in USD per 1 million input tokens.
* **`output_price_per_million`**: Cost in USD per 1 million output tokens.
Expand Down Expand Up @@ -323,4 +323,4 @@ image = RubyLLM.paint(
* **Your Responsibility:** Ensure the model ID is correct for the target endpoint.
* **Warning Log:** A warning is logged indicating validation was skipped.

Use these features when the standard registry doesn't cover your specific model or endpoint needs. For standard models, rely on the registry for validation and capability awareness. See the [Chat Guide]({% link _core_features/chat.md %}) for more on using the `chat` object.
Use these features when the standard registry doesn't cover your specific model or endpoint needs. For standard models, rely on the registry for validation and capability awareness. See the [Chat Guide]({% link _core_features/chat.md %}) for more on using the `chat` object.
26 changes: 26 additions & 0 deletions docs/_core_features/chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,31 @@ response = chat.ask "Compare the user interfaces in these two screenshots.", wit
puts response.content
```

### Working with Videos

You can also analyze video files or URLs with vision-capable models. RubyLLM will automatically detect video files and handle them appropriately.

```ruby
# Ask about a local video file
chat = RubyLLM.chat(model: 'gemini-2.5-flash')
response = chat.ask "What happens in this video?", with: "path/to/demo.mp4"
puts response.content

# Ask about a video from a URL
response = chat.ask "Summarize the main events in this video.", with: "https://example.com/demo_video.mp4"
puts response.content

# Combine videos with other file types
response = chat.ask "Analyze these files for visual content.", with: ["diagram.png", "demo.mp4", "notes.txt"]
puts response.content
```

Notes:

Supported video formats include .mp4, .mov, .avi, .webm, and others (provider-dependent).
Only Google Gemini models currently support video input; check the [Available Models Guide]({% link guides/available-models.md %}) for details.
Large video files may be subject to size or duration limits imposed by the provider.

RubyLLM automatically handles image encoding and formatting for each provider's API. Local images are read and encoded as needed, while URLs are passed directly when supported by the provider.

### Working with Audio
Expand Down Expand Up @@ -230,6 +255,7 @@ response = chat.ask "What's in this image?", with: { image: "photo.jpg" }

**Supported file types:**
- **Images:** .jpg, .jpeg, .png, .gif, .webp, .bmp
- **Videos:** .mp4, .mov, .avi, .webm
- **Audio:** .mp3, .wav, .m4a, .ogg, .flac
- **Documents:** .pdf, .txt, .md, .csv, .json, .xml
- **Code:** .rb, .py, .js, .html, .css (and many others)
Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ chat.ask "What's the best way to learn Ruby?"
```ruby
# Analyze any file type
chat.ask "What's in this image?", with: "ruby_conf.jpg"
chat.ask "What's happening in this video?", with: "video.mp4"
chat.ask "Describe this meeting", with: "meeting.wav"
chat.ask "Summarize this document", with: "contract.pdf"
chat.ask "Explain this code", with: "app.rb"
Expand Down
5 changes: 5 additions & 0 deletions lib/ruby_llm/attachment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ def for_llm

def type
return :image if image?
return :video if video?
return :audio if audio?
return :pdf if pdf?
return :text if text?
Expand All @@ -87,6 +88,10 @@ def image?
RubyLLM::MimeType.image? mime_type
end

def video?
RubyLLM::MimeType.video? mime_type
end

def audio?
RubyLLM::MimeType.audio? mime_type
end
Expand Down
4 changes: 4 additions & 0 deletions lib/ruby_llm/mime_type.rb
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ def image?(type)
type.start_with?('image/')
end

def video?(type)
type.start_with?('video/')
end

def audio?(type)
type.start_with?('audio/')
end
Expand Down
4 changes: 4 additions & 0 deletions lib/ruby_llm/model/info.rb
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ def supports_vision?
modalities.input.include?('image')
end

def supports_video?
modalities.input.include?('video')
end

def supports_functions?
function_calling?
end
Expand Down
5 changes: 5 additions & 0 deletions lib/ruby_llm/providers/gemini/capabilities.rb
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,10 @@ def supports_vision?(model_id)
model_id.match?(/gemini|flash|pro|imagen/)
end

def supports_video?(model_id)
model_id.match?(/gemini/)
end

def supports_functions?(model_id)
return false if model_id.match?(/text-embedding|embedding-001|aqa|flash-lite|imagen|gemini-2\.0-flash-lite/)

Expand Down Expand Up @@ -217,6 +221,7 @@ def modalities_for(model_id)
modalities[:input] << 'pdf'
end

modalities[:input] << 'video' if supports_video?(model_id)
modalities[:input] << 'audio' if model_id.match?(/audio/)
modalities[:output] << 'embeddings' if model_id.match?(/embedding|gemini-embedding/)
modalities[:output] = ['image'] if model_id.match?(/imagen/)
Expand Down
Binary file added spec/fixtures/ruby.mp4
Binary file not shown.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

16 changes: 16 additions & 0 deletions spec/ruby_llm/active_record/acts_as_attachment_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,22 @@ def uploaded_file(path, type)
expect(attachment.type).to eq(:image)
end

it 'handles videos' do
video_path = File.expand_path('../../fixtures/ruby.mp4', __dir__)
chat = Chat.create!(model: model)
message = chat.messages.create!(role: 'user', content: 'Video test')

message.attachments.attach(
io: File.open(video_path),
filename: 'test.mp4',
content_type: 'video/mp4'
)

llm_message = message.to_llm
attachment = llm_message.content.attachments.first
expect(attachment.type).to eq(:video)
end

it 'handles PDFs' do
chat = Chat.create!(model: model)
message = chat.messages.create!(role: 'user', content: 'PDF test')
Expand Down
31 changes: 31 additions & 0 deletions spec/ruby_llm/chat_content_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,14 @@
include_context 'with configured RubyLLM'

let(:image_path) { File.expand_path('../fixtures/ruby.png', __dir__) }
let(:video_path) { File.expand_path('../fixtures/ruby.mp4', __dir__) }
let(:audio_path) { File.expand_path('../fixtures/ruby.wav', __dir__) }
let(:mp3_path) { File.expand_path('../fixtures/ruby.mp3', __dir__) }
let(:pdf_path) { File.expand_path('../fixtures/sample.pdf', __dir__) }
let(:text_path) { File.expand_path('../fixtures/ruby.txt', __dir__) }
let(:xml_path) { File.expand_path('../fixtures/ruby.xml', __dir__) }
let(:image_url) { 'https://upload.wikimedia.org/wikipedia/commons/f/f1/Ruby_logo.png' }
let(:video_url) { 'https://filesamples.com/samples/video/mp4/sample_640x360.mp4' }
let(:audio_url) { 'https://commons.wikimedia.org/wiki/File:LL-Q1860_(eng)-AcpoKrane-ruby.wav' }
let(:pdf_url) { 'https://pdfobject.com/pdf/sample.pdf' }
let(:text_url) { 'https://www.ruby-lang.org/en/about/license.txt' }
Expand Down Expand Up @@ -96,6 +98,35 @@
end
end

describe 'video models' do # rubocop:disable RSpec/MultipleMemoizedHelpers
VIDEO_MODELS.each do |model_info|
provider = model_info[:provider]
model = model_info[:model]

it "#{provider}/#{model} can understand local videos" do
chat = RubyLLM.chat(model: model, provider: provider)
response = chat.ask('What do you see in this video?', with: { video: video_path })

expect(response.content).to be_present
expect(response.content).not_to include('RubyLLM::Content')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an expectation that it recognizes the actual content of the video, like at least includes the words "woman" and "beach"?

Copy link
Contributor Author

@altxtech altxtech Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, 2 reasons:

  1. A bit out of scope. No other tests of this spec do this. If we were to make that change, it would better to do it for all tests for consistency, probably on a separate PR

  2. My understanding is these are more of a boundary interface test. In other words, we are testing if the lib correctly interacts with the providers (sends valid requests), and obtains responses. We are not testing the capability of the models themselves.

But I'll wait on more comments. If more people thinks it makes sense, I can add the assertions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about it not existing in this spec file. I'm a bit surprised about this though as it does exist in the spec for the text models.

expect(response.content).to include('4')

expect(first.content).to include('Matz')

expect(followup.content).to include('199')

I do like how this makes the specs ensure that the models being used actually accomplish the user's intended purpose. But not a showstopper for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a good point, but there is also some common wisdom that a project does not need to test the functionality of its external dependencies.

This depends a little on the testing philosophy of this project, I prefer to wait on maintainer feedback before making this change.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question to ponder. There's value in adding the checks for content as it tests if the LLM has actually received the file. The only problem is understanding: we don't want to test that. Since I don't think we can separate the two I'm leaning towards checking the content too. We should probably add similar content checks to the rest of the spec, but perhaps in another PR.

expect(chat.messages.first.content).to be_a(RubyLLM::Content)
expect(chat.messages.first.content.attachments.first.filename).to eq('ruby.mp4')
expect(chat.messages.first.content.attachments.first.mime_type).to eq('video/mp4')
end

it "#{provider}/#{model} can understand remote videos without extension" do
chat = RubyLLM.chat(model: model, provider: provider)
response = chat.ask('What do you see in this video?', with: video_url)

expect(response.content).to be_present
expect(response.content).not_to include('RubyLLM::Content')
expect(chat.messages.first.content).to be_a(RubyLLM::Content)
expect(chat.messages.first.content.attachments.first.filename).to eq('sample_640x360.mp4')
expect(chat.messages.first.content.attachments.first.mime_type).to eq('video/mp4')
end
end
end

describe 'audio models' do # rubocop:disable RSpec/MultipleMemoizedHelpers
AUDIO_MODELS.each do |model_info|
model = model_info[:model]
Expand Down
8 changes: 7 additions & 1 deletion spec/ruby_llm/models_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,17 @@

# There should be models from at least OpenAI and Anthropic
expect(provider_counts.keys).to include('openai', 'anthropic')
end

# Select only models with vision support
it 'filters by vision support' do
vision_models = RubyLLM.models.select(&:supports_vision?)
expect(vision_models).to all(have_attributes(supports_vision?: true))
end

it 'filters by video support' do
video_models = RubyLLM.models.select(&:supports_video?)
expect(video_models).to all(have_attributes(supports_video?: true))
end
end

describe 'finding models' do
Expand Down
5 changes: 5 additions & 0 deletions spec/support/models_to_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@
{ provider: :vertexai, model: 'gemini-2.5-flash' }
].freeze

VIDEO_MODELS = [
{ provider: :gemini, model: 'gemini-2.0-flash' },
{ provider: :gemini, model: 'gemini-2.5-flash' }
].freeze

AUDIO_MODELS = [
{ provider: :openai, model: 'gpt-4o-mini-audio-preview' },
{ provider: :gemini, model: 'gemini-2.5-flash' }
Expand Down