Skip to content

Question: Multi-frame VL training OOM with Megatron + GRPO (Qwen3-VL-8B) #7326

@LinYuOu

Description

@LinYuOu

Hi all, sorry for the interruption.

I’m currently experimenting with multi-frame visual input training (max_frames=180) using Megatron + GRPO, with the Qwen3-VL-8B model.

Hardware setup

  • 8 × 80GB GPUs
  • bf16 training

Issue

  • GPU memory runs out very quickly
  • Training is stable only up to ~32 frames
  • Increasing max_frames beyond that leads to immediate OOM (often before backward)

Questions

In multi-frame VL (video / multi-image) training, how are CP / SP / TP / PP typically combined in practice to scale to high frame counts?

Train scripts

# LOG_FILE=log_megatron_grpo.txt
# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
MAX_PIXELS=501760
NPROC_PER_NODE=${GPUS_PER_NODE:-4} \
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
megatron rlhf \
    --rlhf_type grpo \
    --model  Qwen3-VL-8B-instruct \
    --load_safetensors true \
    --save_safetensors true \
    --save_interval 100 \
    --context_parallel_size 1 \
    --tensor_model_parallel_size 2 \
    --pipeline_model_parallel_size 2 \
    --dataset  Alert_Dataset\
    --max_epochs 1 \
    --global_batch_size 32 \
    --micro_batch_size 1 \
    --steps_per_generation 4 \
    --num_generations 16 \
    --reward_funcs Alert_Reward \
    --external_plugins plugin.py \
    --use_vllm true \
    --vllm_mode colocate \
    --vllm_gpu_memory_utilization 0.3 \
    --vllm_tensor_parallel_size 4 \
    --vllm_max_model_len 18384 \
    --max_length 16384 \
    --max_completion_length 124 \
    --train_type lora \
    --lora_rank 32 \
    --lora_alpha 128 \
    --lr 5e-5 \
    --bf16 true \
    --beta 0.00 \
    --dynamic_sample false \
    --overlong_filter true \
    --loss_type grpo \
    --sleep_level 2 \
    --offload_model true \
    --offload_optimizer true \
    --log_interval 1 \
    --recompute_granularity selective \
    --finetune \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim \
    --no_save_rng \
    --attention_backend flash \
    --temperature 1.0 \
    --padding_free true \
    --sequence_parallel true \
    --log_completions true \
    2>&1 | tee ${LOG_FILE}

Any insights, configs, or references would be greatly appreciated. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions