[CVPR 2026] VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
Figure 1. Comparison between VDE and standard 50-step sampling across Flux, Qwen-Image, and Wan2.1. VDE achieves comparable visual quality with dramatically reduced runtime (up to 3.01× speedup).
Though Rectified Flow (RF) models have achieved remarkable performance in visual generation, their practical deployments are challenged by slow inference speeds. Previous training-free acceleration methods typically follow a caching-and-reusing paradigm, neglecting the growing mismatch between static cached values and evolving inputs.
We propose Velocity Decomposition and Estimation (VDE), a novel method that shifts the paradigm from caching-and-reusing to decomposing-and-estimating.
- VDE decomposes the model's velocity output into components parallel and orthogonal to the input.
- It exploits the temporal predictability of the components' coefficients and the consistency of the orthogonal direction.
- VDE periodically anchors the model's state and precisely estimates subsequent outputs analytically in an inherently input-adaptive manner.
VDE achieves up to 2.04× - 3.22× acceleration with minimal loss in visual quality, outperforming the best cache-based baseline by 19.5% in SSIM, 30.3% in PSNR, and reducing LPIPS by 55.4% in image generation.
- [2026/06/07] 🗓️ VDE will be presented at CVPR 2026: Sun, Jun 7, 2026, 3:30 PM – 5:30 PM MDT, ExHall A 162.
- [2026/05/31] 📄 VDE is available on CVF Open Access.
- [2026/05/30] 🚀 The code for VDE is officially released! Supports image and video generation/editing.
- [2026/05/22] 📄 VDE is available on arXiv.
- [2026/02/21] 🎉 VDE is accepted by CVPR 2026!
VDE is highly versatile and supports a wide range of state-of-the-art Rectified Flow models across modalities:
🎨 Image Generation
🎥 Video Generation
🧊 3D Generation
Baseline Latency (T=50): 8.20s
| Method | Speedup ↑ | Latency ↓ | Steps ↓ | SSIM ↑ | PSNR ↑ | LPIPS ↓ | CLIP ↑ | ImageReward ↑ |
|---|---|---|---|---|---|---|---|---|
| VDE-fast | 3.01× | 2.72 s | 16 | 0.8267 | 23.19 | 0.1997 | 0.3109 | 0.969 |
| VDE- medium |
2.70× | 3.04 s | 18 | 0.8499 | 24.02 | 0.1679 | 0.3102 | 0.973 |
| VDE-slow | 2.21× | 3.70 s | 22 | 0.8877 | 25.81 | 0.1243 | 0.3095 | 0.978 |
Baseline Latency (T=50): 12.53s
| Method | Speedup ↑ | Latency ↓ | Steps ↓ | SSIM ↑ | PSNR ↑ | LPIPS ↓ | CLIP ↑ | ImageReward ↑ |
|---|---|---|---|---|---|---|---|---|
| VDE-fast | 2.70× | 4.64 s | 18 | 0.8967 | 25.46 | 0.1096 | 0.3163 | 1.287 |
| VDE-slow | 2.04× | 6.14 s | 24 | 0.9362 | 28.58 | 0.0691 | 0.3159 | 1.295 |
Baseline Latency (T=50, 81 frames, 832×480): 175.35s
| Method | Speedup ↑ | Latency ↓ | Steps ↓ | SSIM ↑ | PSNR ↑ | LPIPS ↓ | VBench (%) ↑ |
|---|---|---|---|---|---|---|---|
| VDE-fast | 2.50× | 70.11 s | 20 | 0.8658 | 24.69 | 0.0754 | 80.43 |
| VDE-slow | 2.08× | 84.18 s | 24 | 0.8902 | 25.92 | 0.0554 | 80.32 |
- Release core VDE algorithm and Paper.
- Support Text-to-Image (FLUX, Qwen, Z-Image, HiDream).
- Support Text-to-Video (Wan2.1, HunyuanVideo, Open-Sora).
- Release ComfyUI Custom Nodes.
- Upstream PR to Hugging Face
diffusers.
This project builds upon several excellent open-source projects, including Diffusers, FLUX, Qwen-Image, Z-Image, Wan2.1, and HunyuanVideo. We sincerely thank the authors for their contributions to the community.
This project is licensed under the Apache License 2.0.
If you find VDE useful for your research or applications, please consider giving us a star ⭐ and citing our paper:
@inproceedings{tan2026vde,
title={VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation},
author={Tan, Junwen and Liang, Jinglin and Chen, Hongyuan and Huang, Shuangping},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={37918--37928},
year={2026}
}