feat:add support for 'image_grid_thw'(QwenVL) in DPOTrainer #4091

ycma8 · 2025-09-15T16:39:34Z

What does this PR do?

Added support for QwenVL's image_grid_thw in the DPO trainer.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

qgallouedec · 2025-09-23T00:33:47Z

It seems not to work yet, not sure why:

from typing import Any, Callable
import numpy as np
import torch
from PIL import Image
from datasets import Dataset, features
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForImageTextToText, AutoProcessor

dataset_dict = {
    "prompt": [
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe the image in great detail."}]}],
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Is this bus in the USA?"}]}],
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Give a thorough description of the image."}]}],
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Who are the people in the image?"}]}],
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What is written?"}]}],
    ],
    "chosen": [
        [{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, multi-colored train."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "Yes, it can be assumed that this bus is in the USA."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "There are two individuals, possibly girls or women."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
    ],
    "rejected": [
        [{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, colorful train."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "No, it's not in the USA."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path surrounded by trees."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "In the image, there are two individuals."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
    ],
    "images": [
        [Image.fromarray(np.random.randint(0, 255, (92, 33, 3), dtype=np.uint8))],
        [Image.fromarray(np.random.randint(0, 255, (64, 48, 3), dtype=np.uint8))],
        [Image.fromarray(np.random.randint(0, 255, (80, 152, 3), dtype=np.uint8))],
        [Image.fromarray(np.random.randint(0, 255, (57, 24, 3), dtype=np.uint8))],
        [Image.fromarray(np.random.randint(0, 255, (102, 48, 3), dtype=np.uint8))],
    ],
}
# fmt: on
dataset = Dataset.from_dict(dataset_dict)
dataset = dataset.cast_column("images", features.Sequence(features.Image()))

# Instantiate the model and processor
model_id = "trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration"
model = AutoModelForImageTextToText.from_pretrained(model_id)
ref_model = AutoModelForImageTextToText.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

training_args = DPOConfig(
    output_dir="t",
    per_device_train_batch_size=2,
    remove_unused_columns=False,
    learning_rate=0.01,  # increase learning rate to speed up test
    max_prompt_length=None,  # don't truncate to avoid issues with patch tokens
    max_length=None,
    report_to="none",
)
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    processing_class=processor,
    train_dataset=dataset,
    eval_dataset=dataset,
)

# Save the initial weights, so we can check if they have changed after training
previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

fails

ycma8 · 2025-09-25T08:03:19Z

It seems not to work yet, not sure why:

from typing import Any, Callable
import numpy as np
import torch
from PIL import Image
from datasets import Dataset, features
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForImageTextToText, AutoProcessor

dataset_dict = {
    "prompt": [
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe the image in great detail."}]}],
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Is this bus in the USA?"}]}],
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Give a thorough description of the image."}]}],
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Who are the people in the image?"}]}],
        [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What is written?"}]}],
    ],
    "chosen": [
        [{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, multi-colored train."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "Yes, it can be assumed that this bus is in the USA."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "There are two individuals, possibly girls or women."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
    ],
    "rejected": [
        [{"role": "assistant", "content": [{"type": "text", "text": "The image features a modern, colorful train."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "No, it's not in the USA."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "The image features a forest path surrounded by trees."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": "In the image, there are two individuals."}]}],
        [{"role": "assistant", "content": [{"type": "text", "text": '"ccpb".'}]}],
    ],
    "images": [
        [Image.fromarray(np.random.randint(0, 255, (92, 33, 3), dtype=np.uint8))],
        [Image.fromarray(np.random.randint(0, 255, (64, 48, 3), dtype=np.uint8))],
        [Image.fromarray(np.random.randint(0, 255, (80, 152, 3), dtype=np.uint8))],
        [Image.fromarray(np.random.randint(0, 255, (57, 24, 3), dtype=np.uint8))],
        [Image.fromarray(np.random.randint(0, 255, (102, 48, 3), dtype=np.uint8))],
    ],
}
# fmt: on
dataset = Dataset.from_dict(dataset_dict)
dataset = dataset.cast_column("images", features.Sequence(features.Image()))

# Instantiate the model and processor
model_id = "trl-internal-testing/tiny-Qwen2_5_VLForConditionalGeneration"
model = AutoModelForImageTextToText.from_pretrained(model_id)
ref_model = AutoModelForImageTextToText.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

training_args = DPOConfig(
    output_dir="t",
    per_device_train_batch_size=2,
    remove_unused_columns=False,
    learning_rate=0.01,  # increase learning rate to speed up test
    max_prompt_length=None,  # don't truncate to avoid issues with patch tokens
    max_length=None,
    report_to="none",
)
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    processing_class=processor,
    train_dataset=dataset,
    eval_dataset=dataset,
)

# Save the initial weights, so we can check if they have changed after training
previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

fails

Got it, I’ll check that out

ycma8 · 2025-09-26T07:17:28Z

In Qwen2.5-VL–style models with dynamic visual tokens, pixel_values can have different lengths across samples. Padding along the batch dimension causes two issues: (1) it breaks the spatial correspondence of the image features (i.e., the grid/layout), and (2) it leads to a mismatch between the number of visual features and the number of image placeholder tokens. Therefore, when image_grid_thw is present, pixel_values should not be padded; instead, keep them as the per-sample concatenation guided by image_grid_thw. Also, in this case pixel_values must not be taken only from processed_features["pixel_values"][0].

ycma8 · 2025-09-26T07:32:01Z

@qgallouedec when you have a moment, could you take a look at this?

add support for image_grid_thw

8aed3ba

ycma8 added 2 commits September 26, 2025 00:50

Keep pixel_values shape when image_grid_thw is present

6e89ea6

Disable padding of pixel_values when image_grid_thw exists

1f738b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat:add support for 'image_grid_thw'(QwenVL) in DPOTrainer #4091

feat:add support for 'image_grid_thw'(QwenVL) in DPOTrainer #4091

ycma8 commented Sep 15, 2025 •

edited

Loading

Uh oh!

qgallouedec commented Sep 23, 2025

Uh oh!

ycma8 commented Sep 25, 2025

Uh oh!

ycma8 commented Sep 26, 2025

Uh oh!

ycma8 commented Sep 26, 2025

Uh oh!

Uh oh!

feat:add support for 'image_grid_thw'(QwenVL) in DPOTrainer #4091

Are you sure you want to change the base?

feat:add support for 'image_grid_thw'(QwenVL) in DPOTrainer #4091

Conversation

ycma8 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

qgallouedec commented Sep 23, 2025

Uh oh!

ycma8 commented Sep 25, 2025

Uh oh!

ycma8 commented Sep 26, 2025

Uh oh!

ycma8 commented Sep 26, 2025

Uh oh!

Uh oh!

ycma8 commented Sep 15, 2025 •

edited

Loading