Skip to content

[ACM MM'25] Code for the paper 'AirScape: An Aerial Generative World Model with Motion Controllability'

Notifications You must be signed in to change notification settings

EmbodiedCity/AirScape.code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ACM MM'25] AirScape: An Aerial Generative World Model with Motion Controllability

   

In this repository, we present the dataset and code proposed in the paper, along with the prediction examples generated by AirScape.

📑 Data for Aerial World Model

We present an 11k embodied aerial agent video dataset along with corresponding annotations of motion intention, aligning the inputs and outputs of the aerial world model. The proposed dataset exhibits diversity across multiple dimensions, including actions, areas, scenes, and tasks, as illustrated by the examples below. The dataset is available for download at the HuggingFace link: https://huggingface.co/datasets/EmbodiedCity/AirScape-Dataset.

Action

Video Motion Intention
Translation The drone moves rightward while capturing a video of cars moving along the bridge, keeping the bridge centered in its field of view without obvious gimbal adjustment.
Rotation The drone rotates to the right and maintains a steady altitude and camera angle.
Compound Movement The drone, while maintaining its altitude, continuously flies forward towards the parking lot and adjusts its gimbal downward to an overhead view, capturing a top-down view of the parked cars, and eventually stabilizes above the parking lot.

Area

Video Motion Intention
Roadside The drone moved forward steadily while maintaining altitude. Meanwhile, the drone's gimbal has been adjusted downwards to a 45 degree oblique view, capturing a descending viewpoint of a busy urban road surrounded by buildings and vehicles, and ended its final position directly over the road.
Tourist Attraction The drone follows the red heart-shaped hot air balloon, gradually rotating leftward, maintaining focus on the balloon without camera gimbal adjustments.
Seaside The drone flies forward while keeping the four buildings in the field of view, no significant altitude change and camera movements.

Scene

Video Motion Intention
Night The drone flies forward while maintaining its focus on the bridge and the cars below, keeping the camera gimbal stable and centered on the scene.
Daytime The drone flies forward while keeping both the Maersk container ship and the tugboat in its field of view, with the camera gimbal slightly adjusting to track them continuously.
Snowy The drone flies upwards and slightly backward while rotating left, keeping a snowmobile in its field of view as it moves around the edge of a forested area on a snow-covered landscape.

Task

Video Motion Intention
Navigation The drone moved forward steadily and turned right slightly, maintaining its altitude, with the camera gimbal slightly tilted downward, capturing a street view with parked cars, trees, buildings, and pedestrians, before coming to a stop over a commercial area.
Tracking The drone follows a boat moving forward along the river, maintaining a steady distance while adjusting its position slightly to the left and aligning the camera to keep the boat centered in the field of view.
Detection The drone flies forward while detecting traffic flow on the road.

🔍 Prediction Outcomes of AirScape

AirScape takes the current observations and motion intentions as input and outputs future embodied sequence observations (videos). Below are examples of videos generated on the test set.

Example Prediction Motion Intention
1 The drone moved forward with its camera pointed straight ahead, capturing a stationary view of high-rise buildings, a landscaped garden, and a pond.
2 The drone rotated counterclockwise inplace, with its camera gimbal angled downward, and concluded its flight above a courtyard featuring a circular fountain, swimming pools, and surrounding greenery.
3 The drone hovered in place while gradually rotating to the left, ended up facing a broader view of the buildings and the street below.
4 The drone tilts up its camera, moves slightly forward while maintaining a steady view of a fountain plaza and surrounding area, then holds position for an overhead perspective of the scene.
5 No obvious tracking of a target, the drone moving forward along a road while maintaining the gimbal angle, with its final position being farther down the illuminated street.
6 A group of pedestrians moving from left to right along a walkway, while the drone rotates rightward slowly and its camera gimbal adjusts slightly to follow their motion, keeping them centered and visible in the frame.
7 The drone ascends while capturing a night-time view of a road with vehicles moving forward (away from the drone) and brightly lit buildings in the distance, without obvious tracking or significant camera gimbal movements.
8 The drone flies forward, keeping the skyline of the city centered in its field of view.
9 The drone tracks a cargo ship moving forward along the river, while flying to the right and rotate to the left, maintaining the ship in the center of the field of view.
10 The drone flies to the right, maintaining the current altitude and keeping the gimbal angle level.
11 The drone flies to the left while rotating to the right, rotating clockwise about 45 degrees around the pagoda in the field of view, while keeping the pagoda and surrounding structures centered in the frame.
12 The drone flies to the left while rotating to the right, rotating clockwise slowly around the pagoda in the field of view, while keeping the statue centered in its field of view.
13 The drone is moving forward, adjusting the pan tilt angle downwards to track the movement of two agricultural vehicles proceeding forward in tandem, maintaining them centered in its field of view.
14 The drone follows a combine harvester moving forward through a field, keeping the harvester in the center of its field of view while maintaining a steady altitude and camera angle.

🤖 Airscape Two-Phase Training

To derive controllable world models that adhere to physical spatio-temporal constraints from video generation foundation models, AirScape proposes a two-phase training framework:

  1. Phase 1: Learning Intention Controllability

    • Fine-tuning of the foundation model is performed using the proposed 11k video-intention paired dataset, enabling the model to develop an initial understanding of aerial action intents and their corresponding generation capabilities.
  2. Phase 2: Learning Spatio-Temporal Constraints

    • A self-play training mechanism is introduced, where the model from Phase 1 generates synthetic data. A spatio-temporal discriminator is then employed to perform rejection sampling on the generated videos. The discriminator evaluates four critical attributes: intention alignment, temporal continuity, dynamic degree, and spatial rationality.
      SFT is conducted on high-quality synthetic data to ensure that the generated videos adhere to physical spatio-temporal constraints, suppressing unrealistic outcomes.

Open phase1 to see more details of phase-1 training.

Open phase2 to see more details of phase-2 training.

Method

Citation

If you think our work is helpful, please cite our paper and star 🌟 our repository.

@inproceedings{zhao2025airscape,
  author    = {Baining Zhao and Rongze Tang and Mingyuan Jia and Ziyou Wang and Fanhang Man and Xin Zhang and Yu Shang and Weichen Zhang and Wei Wu and Chen Gao and Xinlei Chen and Yong Li},
  title     = {AirScape: An Aerial Generative World Model with Motion Controllability},
  booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)},
  year      = {2025},
  month     = {October},
  pages     = {1--10},
  address   = {Dublin, Ireland},
  publisher = {ACM},
  location  = {New York, NY, USA},
  doi       = {10.1145/3746027.3758180},
  url       = {https://doi.org/10.1145/3746027.3758180}
}

About

[ACM MM'25] Code for the paper 'AirScape: An Aerial Generative World Model with Motion Controllability'

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •