Question: Multi-frame VL training OOM with Megatron + GRPO (Qwen3-VL-8B)

Hi all, sorry for the interruption.

I’m currently experimenting with **multi-frame visual input training** (`max_frames=180`) using **Megatron + GRPO**, with the **Qwen3-VL-8B** model.

### Hardware setup
- 8 × 80GB GPUs
- bf16 training

### Issue
- GPU memory runs out very quickly
- Training is stable only up to ~32 frames
- Increasing `max_frames` beyond that leads to immediate OOM (often before backward)

### Questions
In multi-frame VL (video / multi-image) training, how are **CP / SP / TP / PP** typically combined in practice to scale to high frame counts?

### Train scripts
```bash

# LOG_FILE=log_megatron_grpo.txt
# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
MAX_PIXELS=501760
NPROC_PER_NODE=${GPUS_PER_NODE:-4} \
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
megatron rlhf \
    --rlhf_type grpo \
    --model  Qwen3-VL-8B-instruct \
    --load_safetensors true \
    --save_safetensors true \
    --save_interval 100 \
    --context_parallel_size 1 \
    --tensor_model_parallel_size 2 \
    --pipeline_model_parallel_size 2 \
    --dataset  Alert_Dataset\
    --max_epochs 1 \
    --global_batch_size 32 \
    --micro_batch_size 1 \
    --steps_per_generation 4 \
    --num_generations 16 \
    --reward_funcs Alert_Reward \
    --external_plugins plugin.py \
    --use_vllm true \
    --vllm_mode colocate \
    --vllm_gpu_memory_utilization 0.3 \
    --vllm_tensor_parallel_size 4 \
    --vllm_max_model_len 18384 \
    --max_length 16384 \
    --max_completion_length 124 \
    --train_type lora \
    --lora_rank 32 \
    --lora_alpha 128 \
    --lr 5e-5 \
    --bf16 true \
    --beta 0.00 \
    --dynamic_sample false \
    --overlong_filter true \
    --loss_type grpo \
    --sleep_level 2 \
    --offload_model true \
    --offload_optimizer true \
    --log_interval 1 \
    --recompute_granularity selective \
    --finetune \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim \
    --no_save_rng \
    --attention_backend flash \
    --temperature 1.0 \
    --padding_free true \
    --sequence_parallel true \
    --log_completions true \
    2>&1 | tee ${LOG_FILE}
```


Any insights, configs, or references would be greatly appreciated. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: Multi-frame VL training OOM with Megatron + GRPO (Qwen3-VL-8B) #7326

Hardware setup

Issue

Questions

Train scripts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: Multi-frame VL training OOM with Megatron + GRPO (Qwen3-VL-8B) #7326

Description

Hardware setup

Issue

Questions

Train scripts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions