Skip to content

vita-epfl/AVG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

✨ Anchored Video Generation (AVG)

Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

Mariam Hassan* · Bastien Van Delft* · Wuyang Li · Alexandre Alahi
VITA @ EPFL
*Equal contribution


📌 Overview

grid_side_by_side.mp4

State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. We introduce Anchored Video Generation (AVG), a pipeline that decouples T2V generation into three specialized stages:

  1. Reasoning (LLM rewrites prompt to an initial-scene description),
  2. Composition (T2I generates a high-quality anchor frame),
  3. Temporal Synthesis (video model animates the anchored scene).

FVG improves performance on T2V benchmarks and enables faster sampling via visual anchoring.

🚧 Code

Code is coming soon. We’ll release training + inference and evaluation scripts.

Citation

@misc{hassan2025factorized,
  title        = {Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models},
  author       = {Hassan, Mariam and Van Delft, Bastien and Li, Wuyang and Alahi, Alexandre},
  year         = {2025},
  eprint       = {2512.16371},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  doi          = {10.48550/arXiv.2512.16371}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors