Skip to content

Conversation

@pedrohenriqueamartins
Copy link

What does this PR do ?

Add video support for Qwen3VLModel, enabling training with video inputs alongside images.

Changes

  • Remove blocking assertions for video inputs
  • Handle pixel_values_videos and video_grid_thw parameters
  • Concatenate image and video vision data when both are present
  • Split vision embeddings by video_start_index for proper masking
  • Combine image_mask | video_mask for deepstack processing

Testing

  • Successfully trained models with video data on Nemo-RL using the megatron backend.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33
Copy link
Contributor

@pedrohenriqueamartins : thanks for contribution, we are working something similar. Might cherrypick your change directly. We will need to do verification in mbridge sft pipeline as well to get this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants