Skip to content

Image to image with gemini-2.0-flash-preview-image-generation #248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 46 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
f3136e3
Support modalities for gemini-2.0-flash-preview-image-generation
tpaulshippy Jun 14, 2025
6e3128c
Extract images from chat response
tpaulshippy Jun 14, 2025
28cf942
Rubocop
tpaulshippy Jun 14, 2025
357ea8a
Set modalities from capabilities
tpaulshippy Jun 14, 2025
0fed244
Merge branch 'main' into image-to-image
tpaulshippy Jul 20, 2025
9a1eeb8
Attach output image to message content
tpaulshippy Jul 20, 2025
1f60caa
Update comment
tpaulshippy Jul 20, 2025
98097d4
Refine image in conversation
tpaulshippy Jul 20, 2025
1b58d43
Merge branch 'main' into image-to-image
tpaulshippy Jul 24, 2025
d14d1e9
Remove duplicate SEO tags from docs
crmne Jul 28, 2025
460108c
Updated models
crmne Jul 28, 2025
730a8c8
fix: add missing blank lines for improved readability in generator an…
crmne Jul 28, 2025
1c50176
fix: Rails integration with_context now works without global config
crmne Jul 30, 2025
2afc6b2
Anthropic: Fix system prompt (use plain text instead of serialized JS…
MichaelHoste Jul 30, 2025
af0ead4
Provide access to raw response object from Faraday (#304)
tpaulshippy Jul 30, 2025
98dabdb
Add Chat#on_tool_call callback (#299)
bryan-ash Jul 30, 2025
cbb4276
Added proper handling of streaming error responses across both Farada…
dansingerman Jul 30, 2025
8626a77
Add message ordering guidance to Rails docs (#288)
crmne Jul 30, 2025
b6095a5
Bump version to 1.4.0 and update VCR cassettes
crmne Jul 30, 2025
8b0809b
Update model pricing and capabilities in JSON configuration
crmne Jul 30, 2025
598b584
Fix Action Cable capitalization in Rails guide
crmne Jul 31, 2025
20ae7a5
Update README and docs with comprehensive feature list
crmne Jul 31, 2025
7595a4e
Update Rails guide with instant message display pattern
crmne Jul 31, 2025
8f0ba07
Add Perplexity provider support
crmne Jul 31, 2025
7842a0b
Move available models guide to top-level navigation
crmne Jul 31, 2025
fe9d9d9
Fix broken links to available-models guide after relocation
crmne Jul 31, 2025
b6f9c13
Add Mistral AI provider support
crmne Jul 31, 2025
81f0a8c
Update specs to disable additional RuboCop checks for multi-turn conv…
crmne Jul 31, 2025
c5e059a
docs: add mistral provider
crmne Jul 31, 2025
9ae1018
reorder providers alphabetically
crmne Jul 31, 2025
2336483
Bust cache of gem version badge in README
crmne Jul 31, 2025
bf8c096
Fix Rails generator migration order and PostgreSQL detection
crmne Jul 31, 2025
2ff42aa
Removed unnecessary rubocop disable comments after last commit
crmne Jul 31, 2025
5a19a41
Fix Mistral models created_at timestamps
crmne Jul 31, 2025
a405e9e
Version bump to 1.5.0
crmne Jul 31, 2025
07444e5
Fix model capabilities format and imagen output modality
crmne Aug 1, 2025
a6dcd40
Automatically generate appraisal gemfiles
crmne Aug 1, 2025
b3b4684
Update JRuby version in CI matrix to jruby-10.0.1.0
crmne Aug 1, 2025
98f0cd1
Bump version to 1.5.1
crmne Aug 1, 2025
db1d563
Bust cache for gem badge in README
crmne Aug 1, 2025
43afe2f
Bust cache again for gem badge
crmne Aug 1, 2025
4f7a163
Wire up on_tool_call when using acts_as_chat rails integration (#318)
agarcher Aug 1, 2025
f291744
Resolve rubocop offenses
tpaulshippy Aug 3, 2025
76c7714
Merge branch 'main' into image-to-image
tpaulshippy Aug 3, 2025
84a939f
Update guides
tpaulshippy Aug 3, 2025
5c2c5d2
Merge branch 'main' into image-to-image
tpaulshippy Aug 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion docs/guides/chat.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ Modern AI models can often process more than just text. RubyLLM provides a unifi

### Working with Images

Provide image paths or URLs to vision-capable models (like `gpt-4o`, `claude-3-opus`, `gemini-1.5-pro`).
Provide image paths or URLs to vision-capable models (like `gpt-4o`, `claude-3-opus`, `gemini-1.5-pro`) for analysis and understanding. Some specialized models can also generate and edit images.

```ruby
# Ensure you select a vision-capable model
Expand All @@ -146,6 +146,34 @@ puts response.content

RubyLLM handles converting the image source into the format required by the specific provider API.

### Image Generation with Chat

While most vision models analyze images, some specialized models can generate and edit images through the chat interface. This approach is ideal for image editing workflows and iterative refinement:

```ruby
# Use a model capable of image generation
chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')

# Edit an existing image
response = chat.ask('make this look more futuristic', with: 'current_design.png')

# Access generated images from attachments
if response.content.attachments.any?
generated_image = response.content.attachments.first.image
puts "Generated image: #{generated_image.mime_type}"

# Save the generated image
generated_image.save('futuristic_design.png')
end

# Continue refining in the same conversation
response = chat.ask('add some neon lighting effects')
refined_image = response.content.attachments.first.image
refined_image.save('futuristic_with_neon.png')
```

For simple text-to-image generation without existing images, see the [Image Generation Guide]({% link guides/image-generation.md %}).

### Working with Audio

Provide audio file paths to audio-capable models (like `gpt-4o-audio-preview`).
Expand Down
84 changes: 80 additions & 4 deletions docs/guides/image-generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ Turn your wildest imagination into reality! 🎨 Create professional artwork, pr
After reading this guide, you will know:

* How to generate images from text prompts.
* How to edit and modify existing images.
* How to refine images through multi-turn conversations.
* How to select different image generation models.
* How to specify image sizes (for supported models).
* How to access and save generated image data (URL or Base64).
Expand Down Expand Up @@ -98,6 +100,75 @@ end

Refer to the [Working with Models Guide]({% link guides/models.md %}) and the [Available Models Guide]({% link available-models.md %}) to find image models.

## Image Editing & Modification

Beyond generating images from text prompts, you can also edit and modify existing images using capable models like `gemini-2.0-flash-preview-image-generation`. This approach uses the chat interface rather than the `paint` method.

### Basic Image Editing

Use the chat interface with image generation models to edit existing images:

```ruby
# Start a chat with an image generation model
chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')

# Edit an existing image
response = chat.ask('put this in a ring', with: 'path/to/ruby.png')

# Access the generated image from the response
image = response.content.attachments.first.image

# Check image properties
puts "Generated image: #{image.mime_type}"
puts "Base64 encoded: #{image.base64?}"
puts "Data size: ~#{image.data.length} bytes" if image.base64?

# Save the edited image
saved_path = image.save('ruby_with_ring.png')
puts "Saved to: #{saved_path}"
```

### Multi-turn Image Refinement

One of the powerful features of using the chat interface is the ability to refine generated images through conversation:

```ruby
chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')

# First edit - add a ring to the ruby image
chat.ask('put this in a ring', with: 'path/to/ruby.png')

# Refine the result in the same conversation
response = chat.ask('change the background to blue')

# The model will modify the previously generated image
refined_image = response.content.attachments.first.image
refined_image.save('ruby_ring_blue_background.png')

# Continue refining
response = chat.ask('make the ring more ornate and golden')
final_image = response.content.attachments.first.image
final_image.save('ruby_ornate_golden_ring.png')
```

### Chat vs Paint Methods

RubyLLM provides two approaches for image generation:

- **`RubyLLM.paint`**: Best for simple text-to-image generation from scratch
- **`RubyLLM.chat` with image models**: Best for image editing, refinement, and complex workflows

Use the chat interface for:
- Editing existing images
- Multi-turn image refinement and iteration
- Complex image generation workflows
- When you need conversation context and memory

Use the paint method for:
- Simple text-to-image generation
- One-off image creation
- When you don't need conversation context

## Image Sizes

Some models, like DALL-E 3, allow you to specify the desired image dimensions via the `size:` argument.
Expand All @@ -124,7 +195,7 @@ Not all models support size customization. If a size is specified for a model th

## Working with Generated Images

The `RubyLLM::Image` object provides access to the generated image data and metadata.
The `RubyLLM::Image` object provides access to the generated image data and metadata, whether the image was created using `RubyLLM.paint` or retrieved from a chat response.

### Accessing Image Data

Expand All @@ -138,10 +209,15 @@ The `RubyLLM::Image` object provides access to the generated image data and meta
The `save` method works regardless of whether the image was delivered via URL or Base64. It fetches the data if necessary and writes it to the specified file path.

```ruby
# Generate an image (works for DALL-E or Imagen)
# Generate an image using paint method (works for DALL-E or Imagen)
image = RubyLLM.paint("A steampunk mechanical owl")

# Save the image to a local file
# Or get an image from a chat response
# chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')
# response = chat.ask("Create a steampunk mechanical owl")
# image = response.content.attachments.first.image

# Save the image to a local file (works the same for both methods)
begin
saved_path = image.save("steampunk_owl.png")
puts "Image saved to #{saved_path}"
Expand Down Expand Up @@ -280,6 +356,6 @@ Image generation can take several seconds (typically 5-20 seconds depending on t

## Next Steps

* [Chatting with AI Models]({% link guides/chat.md %}): Learn about conversational AI.
* [Chatting with AI Models]({% link guides/chat.md %}): Learn about conversational AI and using chat for advanced image workflows.
* [Embeddings]({% link guides/embeddings.md %}): Explore text vector representations.
* [Error Handling]({% link guides/error-handling.md %}): Master handling API errors.
5 changes: 5 additions & 0 deletions lib/ruby_llm/content.rb
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@ def add_attachment(source, filename: nil)
self
end

def attach(attachment)
@attachments << attachment
self
end

def format
if @text && @attachments.empty?
@text
Expand Down
19 changes: 19 additions & 0 deletions lib/ruby_llm/image_attachment.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# frozen_string_literal: true

module RubyLLM
# A class representing a file attachment that is an image generated by an LLM.
class ImageAttachment < Attachment
attr_reader :image, :content

def initialize(data:, mime_type:, model_id:)
super(nil, filename: nil)
@image = Image.new(data:, mime_type:, model_id:)
@content = Base64.strict_decode64(data)
@mime_type = mime_type
end

def image?
true
end
end
end
3 changes: 3 additions & 0 deletions lib/ruby_llm/providers/gemini/capabilities.rb
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,9 @@ def modalities_for(model_id)
# Embedding output
modalities[:output] << 'embeddings' if model_id.match?(/embedding|gemini-embedding/)

# Image output
modalities[:output] << 'image' if model_id.match?(/image-generation/)

# Image output for imagen models
modalities[:output] = ['image'] if model_id.match?(/imagen/)

Expand Down
3 changes: 2 additions & 1 deletion lib/ruby_llm/providers/gemini/chat.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ def render_payload(messages, tools:, temperature:, model:, stream: false, schema
payload = {
contents: format_messages(messages),
generationConfig: {
temperature: temperature
temperature: temperature,
responseModalities: capabilities.modalities_for(model)[:output]
}
}

Expand Down
16 changes: 15 additions & 1 deletion lib/ruby_llm/providers/gemini/streaming.rb
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,21 @@ def extract_content(data)
return nil unless parts

text_parts = parts.select { |p| p['text'] }
text_parts.map { |p| p['text'] }.join if text_parts.any?
image_parts = parts.select { |p| p['inlineData'] }

content = RubyLLM::Content.new(text_parts.map { |p| p['text'] }.join)

image_parts.map do |p|
content.attach(
ImageAttachment.new(
data: p['inlineData']['data'],
mime_type: p['inlineData']['mimeType'],
model_id: data['modelVersion']
)
)
end

content
end

def extract_input_tokens(data)
Expand Down

Large diffs are not rendered by default.

Large diffs are not rendered by default.

66 changes: 66 additions & 0 deletions spec/ruby_llm/image_to_image_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# frozen_string_literal: true

require 'spec_helper'
require 'tempfile'

def save_and_verify_image(image)
# Create a temp file to save to
temp_file = Tempfile.new(['image', '.png'])
temp_path = temp_file.path
temp_file.close

begin
saved_path = image.save(temp_path)
expect(saved_path).to eq(temp_path)
expect(File.exist?(temp_path)).to be true

file_size = File.size(temp_path)
expect(file_size).to be > 1000 # Any real image should be larger than 1KB
ensure
# Clean up
File.delete(temp_path)
end
end

RSpec.describe RubyLLM::Image do
include_context 'with configured RubyLLM'

describe 'basic functionality' do
it 'gemini/gemini-2.0-flash-preview-image-generation can paint images' do
chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')
response = chat.ask('put this in a ring', with: 'spec/fixtures/ruby.png')

expect(response.content.text).to include('ruby')

expect(response.content.attachments).to be_an(Array)
expect(response.content.attachments).not_to be_empty

image = response.content.attachments.first.image

expect(image.base64?).to be(true)
expect(image.data).to be_present
expect(image.mime_type).to include('image')

save_and_verify_image image
end

it 'gemini/gemini-2.0-flash-preview-image-generation can refine images in a conversation' do
chat = RubyLLM.chat(model: 'gemini-2.0-flash-preview-image-generation')
chat.ask('put this in a ring', with: 'spec/fixtures/ruby.png')
response = chat.ask('change the background to blue')

expect(response.content.text).to include('ruby')

expect(response.content.attachments).to be_an(Array)
expect(response.content.attachments).not_to be_empty

image = response.content.attachments.first.image

expect(image.base64?).to be(true)
expect(image.data).to be_present
expect(image.mime_type).to include('image')

save_and_verify_image image
end
end
end