In this repository, we present the dataset and code proposed in the paper, along with the prediction examples generated by AirScape.
We present an 11k embodied aerial agent video dataset along with corresponding annotations of motion intention, aligning the inputs and outputs of the aerial world model. The proposed dataset exhibits diversity across multiple dimensions, including actions, areas, scenes, and tasks, as illustrated by the examples below. The dataset is available for download at the HuggingFace link: https://huggingface.co/datasets/EmbodiedCity/AirScape-Dataset.
AirScape takes the current observations and motion intentions as input and outputs future embodied sequence observations (videos). Below are examples of videos generated on the test set.
To derive controllable world models that adhere to physical spatio-temporal constraints from video generation foundation models, AirScape proposes a two-phase training framework:
-
Phase 1: Learning Intention Controllability
- Fine-tuning of the foundation model is performed using the proposed 11k video-intention paired dataset, enabling the model to develop an initial understanding of aerial action intents and their corresponding generation capabilities.
-
Phase 2: Learning Spatio-Temporal Constraints
- A self-play training mechanism is introduced, where the model from Phase 1 generates synthetic data. A spatio-temporal discriminator is then employed to perform rejection sampling on the generated videos. The discriminator evaluates four critical attributes: intention alignment, temporal continuity, dynamic degree, and spatial rationality.
SFT is conducted on high-quality synthetic data to ensure that the generated videos adhere to physical spatio-temporal constraints, suppressing unrealistic outcomes.
- A self-play training mechanism is introduced, where the model from Phase 1 generates synthetic data. A spatio-temporal discriminator is then employed to perform rejection sampling on the generated videos. The discriminator evaluates four critical attributes: intention alignment, temporal continuity, dynamic degree, and spatial rationality.
Open phase1 to see more details of phase-1 training.
Open phase2 to see more details of phase-2 training.
If you think our work is helpful, please cite our paper and star 🌟 our repository.
@inproceedings{zhao2025airscape,
author = {Baining Zhao and Rongze Tang and Mingyuan Jia and Ziyou Wang and Fanhang Man and Xin Zhang and Yu Shang and Weichen Zhang and Wei Wu and Chen Gao and Xinlei Chen and Yong Li},
title = {AirScape: An Aerial Generative World Model with Motion Controllability},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)},
year = {2025},
month = {October},
pages = {1--10},
address = {Dublin, Ireland},
publisher = {ACM},
location = {New York, NY, USA},
doi = {10.1145/3746027.3758180},
url = {https://doi.org/10.1145/3746027.3758180}
}


























