The benchmark is designed to evaluate whether Multimodal Large Language Models (MLLMs) can process multi-UAV collaborative visual data for question answering, covering perception, reasoning, and decision-making in perception degradation scenarios.
- Paper: https://arxiv.org/pdf/2511.11025
- Project: https://embodiedcity.github.io/AirCopBench/
- Dataset: https://huggingface.co/datasets/EasonFan/AirCopBench/tree/main
- π Our paper has been accepted by AAAI 2026!
- β All datasets, code, and supplementary material released
- β Unified question generation pipeline for 14 tasks
- β One-click integration script for interactive VQA generation
- π‘ Central Contribution: The first comprehensive benchmark focusing on multi-UAV collaborative embodied perception and reasoning, including over 2.9k multi-view images and over 14.6k Visual Question Answering (VQA) pairs.
- π Varied Data Sources: simulator data (with 3, 5, and 6 observing UAVs), real-world data (with 2 observing UAVs), and derived data for noisy and loss scenarios.
- π Rich Task Definition: 4 main task dimensions and 14 subtasks covering collaborative perception, understanding, and reasoning.
- π«οΈ Various Perception Degradation: occlusion, shadow, lighting imbalance, long distance, out of FoV, noise, data loss, and motion blur.
- π― Diverse Target Types: vehicles, drones, pedestrians, and bicycles.
- π§© Multiple Modalities: RGB images, text, and point cloud.
AirCopBench encompasses four core task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision. These are further segmented into sub-tasks, enabling a more granular assessment of collaborative perception and reasoning capabilities.
- Scene Understanding (SU)
scene_description: interpret and understand scenes from imagesscene_comparison: compare different scenesobserving_posture: analyze camera posture
- Object Understanding (OU)
object_recognition: identify objects in imagesobject_counting: count number of objectsobject_grounding: locate objects in imagesobject_matching: match objects across different views
- Perception Assessment (PA)
quality_assessment: evaluate image qualityusability_assessment: assess usefulness for target perception taskscausal_assessment: reason about reasons for perception degradation
- Collaborative Decision (CD)
when_to_collaborate: identify when collaboration is essential for current UAV (temporal decision)what_to_collaborate: determine the information that should be shared between UAVs (content selection)who_to_collaborate: assess which UAVs are best suited for collaboration (agent selection)why_to_collaborate: explore the reasons for required information exchange among UAVs (reasoning for collaboration)
AirCopBench generation pipeline includes 4 main steps: Data Collection, Data Annotation, Question Generation, and Quality Control. This systematic approach ensures the validity and high quality of the generated dataset.
# Main project dependencies
pip install -r requirements.txt
# Simulator collection dependencies
cd Data_Collection/Simulator_Collection/EmbodiedCity_Collection
pip install -r requirements.txtexport OPENAI_API_KEY=your_api_keypython integrated_vqa.py- Follow the prompts to select dataset, task, and subtask.
- Results will be saved as JSON files in the current directory.
# Set detailed API configuration in scripts
import openai
openai.api_key = "your_api_key"
openai.api_base = "https://api.openai.com/v1" # Optional: custom API endpoint
openai.timeout = 60 # Set timeout (seconds)# Batch process multiple datasets
python integrated_vqa.py --batch --datasets Sim3,Sim5,Sim6,Real2 --tasks CD,OU,PA,SU# Specify output format and path
python integrated_vqa.py --output-format json --output-dir ./resultsYou can also run any original task script directly (e.g. Sim3_CD.py, Real2_OU.py):
cd VQA_Generation/VQA_Sim3
python Sim3_CD.py- All scripts use relative paths based on
datasets/. No need to edit paths. - Results are saved as JSON files in the script directory.
AirCopBench/
βββ Data_Collection/ # Data collection module
β βββ Derived_Collection/ # Image post-processing tools
β β βββ apply_noise_to_image.py # Image noise addition
β β βββ export_to_excel.py # JSON to Excel export
β βββ Simulator_Collection/ # Simulator data collection
β βββ EmbodiedCity_Collection/ # EmbodiedCity simulator collection
β βββ main.py # Main collection script
β βββ config.py # Configuration parsing
β βββ uav_manager.py # UAV management
β βββ motion.py # Motion pattern definitions
β βββ recorder.py # Video recording
β βββ manual_trajectory_recorder.py # Manual trajectory recording
β βββ print_point.py # Point viewing tool
β βββ scenarios/ # Scenario configuration files
β βββ requirements.txt # Dependencies
β βββ utils.py # Utility functions
βββ Data_Annotation/ # Data annotation module
β βββ Real2_Sample.json # Real2 dataset annotation example
β βββ Sim3_Sample.json # Sim3 dataset annotation example
β βββ Sim5_Sample.json # Sim5 dataset annotation example
β βββ Sim6_Sample.json # Sim6 dataset annotation example
βββ VQA_Generation/ # VQA generation module
β βββ integrated_vqa.py # Integrated VQA generation script
β βββ VQA_Sim3/ # Sim3 VQA generation
β βββ VQA_Sim5/ # Sim5 VQA generation
β βββ VQA_Sim6/ # Sim6 VQA generation
β βββ VQA_Real2/ # Real2 VQA generation
βββ AirCopBench_evaluation/ # Evaluation Code for AirCopBench
β βββ evaluation.py # Evaluation code example using gpt-4o
βββ AirCopBench_sft/ # Configuration of SFT on AirCopBench
β βββ llava13b_vqa_sft.yaml # Configuration for fine-tuning llava-next-13b
β βββ qwen2_5vl_lora_sft.yaml # Configuration for fine-tuning qwen-2.5-vl/qwen-2-vl
βββ requirements.txt # Main project dependencies
βββ README.md # Project documentation
AirCopBench comprises 2,920 simultaneous multi-view images with various challenging perception degradation collected from real world and simulators, as shown in Fig.b&c. The dataset includes 14,610 questions, including both basic tasks like scene and object understanding, as well as advanced tasks like perception assessment and collaborative decision, as detailed in Fig.a.
cd Data_Collection/Derived_Collection
python apply_noise_to_image.pycd Data_Collection/Derived_Collection
python export_to_excel.pyPlease refer to LLaMA-Factory.
We have offered an example code using gpt-4o to conduct evaluation on our benchmark.
python AirCopBench_evaluation/evaluation.py # remember to set the api and dataset path in the code.Thanks to all contributors and the open-source community for inspiration and support.
If you use this project in your research, please cite the following paper:
@article{zha2025aircopbench,
title={AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning},
author={Zha, Jirong and Fan, Yuxuan and Zhang, Tianyu and Chen, Geng and Chen, Yingfeng and Gao, Chen and Chen, Xinlei},
journal={arXiv preprint arXiv:2511.11025},
year={2025}
}


