VeRL-Omni is a general RL training framework focused on multimodal generative models, built on top of verl.
It originated from the multi-modal generation RL effort in verl, and now has a dedicated home so it can evolve in a more focused way.
Multimodal generative RL training differs from text-only LLM RL not only in model structure, but also in I/O patterns, compute characteristics, and runtime bottlenecks. As this space grows, it deserves a dedicated training repository that can evolve quickly around its own constraints.
VeRL-Omni targets RL post-training for three families of generative models:
- Diffusion generative models for image, video, and audio — e.g., Qwen-Image, Wan2.2.
- Unified multimodal understanding + generation models — e.g., BAGEL, HunyuanImage-3.0.
- Omni-modality models that jointly handle text, image, audio, and video — e.g., Qwen3-Omni.
- Specialized rollout via
vLLM-Omnifor high-throughput diffusion and multimodal generation. - Flexible reward pipelines spanning rule-based rewards, model-based rewards, and multimodal reward computation.
- Modular training backends that plug into existing parallelism (FSDP, USP) and other optimizations rather than rebuilding the stack from scratch.
- End-to-end examples and benchmarks validating co-located sync and fully-async RL on the model families above.
- High training throughput — on our reference Qwen-Image FlowGRPO setup,
VeRL-Omniachieves ~25% higher end-to-end throughput than the diffusers-basedflow_grpoimplementation, driven byvLLM-Omnirollout, FSDP training, and overlapped reward computation (asynchronous).
Visit our documentation to learn more.
| Model | Category | Modality | Algorithm | Status |
|---|---|---|---|---|
| Qwen-Image | Diffusion generator | Text → Image | FlowGRPO | ✅ |
| Wan2.2 | Diffusion generator | Text → Video | DanceGRPO | WIP |
| BAGEL | Unified understand + gen | Text + Image | FlowGRPO | WIP |
| HunyuanImage-3.0 | Unified understand + gen | Text + Image | MixGRPO/SRPO | Planned |
| Qwen3-Omni-Thinker | Omni-modality | Text / Image / Video / Audio | GRPO | WIP |
Future work is tracked here:
Contributions are welcome.
See the contribution guide.
verl-omni builds on the engineering foundations developed in verl and is closely aligned with multimodal inference systems such as vLLM-Omni.
If you find the project helpful, please cite:
@misc{verlomni_github,
title = {{VeRL-Omni: Easy, Fast, and Stable RL Training for Diffusion and Omni-Modality Models}},
author = {Yongxiang Huang and Cheung Kawai and Jingan Zhou and Yingshu Chen and {openYuanrong Team} and Xibin Wu},
year = {2026},
howpublished = {\url{https://github.com/verl-project/verl-omni}},
urldate = {2026-04-28}
}