Mariam Hassan* · Bastien Van Delft* · Wuyang Li · Alexandre Alahi
VITA @ EPFL
*Equal contribution
grid_side_by_side.mp4
State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. We introduce Anchored Video Generation (AVG), a pipeline that decouples T2V generation into three specialized stages:
- Reasoning (LLM rewrites prompt to an initial-scene description),
- Composition (T2I generates a high-quality anchor frame),
- Temporal Synthesis (video model animates the anchored scene).
FVG improves performance on T2V benchmarks and enables faster sampling via visual anchoring.
Code is coming soon. We’ll release training + inference and evaluation scripts.
@misc{hassan2025factorized,
title = {Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models},
author = {Hassan, Mariam and Van Delft, Bastien and Li, Wuyang and Alahi, Alexandre},
year = {2025},
eprint = {2512.16371},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
doi = {10.48550/arXiv.2512.16371}
}