This is the official implementation of the paper:
Citation:
@article{huang2025mvisu,
title={MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions},
author={Huang, Zeyu and Wang, Juyuan and Chen, Longfeng and Xiao, Boyi and Cai, Leng and Zeng, Yawen and Xu, Jin},
journal={arXiv preprint arXiv:2508.09057},
year={2025}
}
This is the official repository for the paper:
MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions
MVISU-Bench is a comprehensive benchmark dataset for multilingual visual understanding tasks, specifically designed for evaluating mobile agents' capabilities in real-world scenarios. This repository contains carefully curated datasets in both English and Chinese, designed to facilitate research and development in cross-lingual visual understanding and mobile agent evaluation.
The repository also includes LLaMA-Factory-Aider, a customized version of LLaMA Factory specifically adapted for training and fine-tuning mobile agent models. This toolkit provides comprehensive support for model training, evaluation, and deployment.
- Bilingual Support: Parallel datasets in English and Chinese
- High Quality: Expert-annotated data with rigorous quality control
- Comprehensive: Covers various visual understanding scenarios
- Easy to Use: Simple JSON format for easy integration
- Real-World Focus: Specifically designed for mobile agent evaluation
- Diverse Scenarios: Includes multi-app, vague, interactive, single-app, and unethical instruction cases
MVISU-Bench/
βββ data/
β βββ en.json # English dataset
β βββ zh.json # Chinese dataset
βββ LLaMA-Factory-Aider/ # LLaMA Factory Aider toolkit
β βββ data/
β βββ evaluation/
β βββ src/
β βββ tests/
β βββ assets/
β βββ examples/
β βββ scripts/
β βββ docker/
β βββ ... (other configuration files)
βββ LICENSE
βββ README.md
graph TD
A[MVISU-Bench] --> B[data]
A --> C[LLaMA-Factory-Aider]
A --> D[LICENSE]
A --> E[README.md]
B --> B1[en.json]
B --> B2[zh.json]
C --> C1[data]
C --> C2[evaluation]
C --> C3[src]
C --> C4[tests]
C --> C5[assets]
C --> C6[examples]
C --> C7[scripts]
C --> C8[docker]
C --> C9[other configuration files]
style A fill:#f9f,stroke:#333,stroke-width:2px
style B fill:#bbf,stroke:#333,stroke-width:2px
style C fill:#bbf,stroke:#333,stroke-width:2px
style D fill:#bfb,stroke:#333,stroke-width:2px
style E fill:#bfb,stroke:#333,stroke-width:2px
Each entry in the dataset is a JSON object with the following fields:
{
"ID": 1,
"TaskType": "Single-App",
"APP": ["Google"],
"APPType": ["General Tool"],
"Instruction": "Search on Google to tell me how French fries should be cooked."
}
- ID: Unique identifier for the instruction
- TaskType: Instruction category (e.g., Multi-App, Vague, Interactive, Single-App, Unethical)
- APP: List of involved app names (can be empty for vague instructions)
- APPType: List of app categories (aligned with APP, can be empty)
- Instruction: The user instruction text
For vague instructions,
APP
andAPPType
may be empty arrays
- Total Instructions: 404 (Chinese: 206, English: 198)
- Instruction Category Distribution:
- Multi-App: CN 62 (30.10%), EN 56 (28.28%)
- Vague: CN 36 (17.48%), EN 36 (18.18%)
- Interactive: CN 32 (15.53%), EN 36 (18.18%)
- Single-App: CN 40 (19.42%), EN 35 (17.68%)
- Unethical: CN 36 (17.48%), EN 35 (17.68%)
- Application Category Distribution:
- System Tool: CN 11 (16.18%), EN 10 (14.49%)
- Lifestyle: CN 28 (41.18%), EN 25 (36.23%)
- Social Media: CN 6 (8.82%), EN 9 (13.04%)
- Shopping: CN 4 (5.88%), EN 5 (7.25%)
- General Tool: CN 19 (27.94%), EN 20 (28.99%)
The MVISU-Bench dataset is available on HuggingFace: MVISU-Bench Dataset on HuggingFace
You can find the Qwen2.5_vl_3B_Aider model weights on HuggingFace: Qwen2.5_vl_3B_Aider on HuggingFace
This project is based on the open-source project LLaMA Factory, with customizations specifically for mobile agent training and evaluation.
# git clone repo
cd LLaMA-Factory-Aider
conda create -n your_env_name python==3.10.16
conda activate your_env_name
pip install -e ".[torch,metrics]" --no-build-isolation
In this project, we provide four types of training data demos, located at LLaMA-Factory-Aider/data/mllm_demo_data
. You can collect your own data based on your specific needs to perform personalized fine-tuning of Aider. For details on data formats, data registration, and more, please refer to the official LLaMA Factory README.
The Training script is as follows:
CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \
--do_train True \
--model_name_or_path your_qwen2.5_vl_path \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--template qwen2_vl \
--flash_attn auto \
--dataset_dir data \
--dataset demo_mllm_qwen2.5vl\
--cutoff_len 2048 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--max_samples 100000 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 3000 \
--warmup_steps 0 \
--optim adamw_torch \
--packing False \
--report_to none \
--output_dir save_dir \
--bf16 True \
--plot_loss True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all
The Merging script is as follows:
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export \
--model_name_or_path your_qwen2.5_vl_path \
--adapter_name_or_path your_qwen2.5_vl_sft_path \
--template llama3 \
--finetuning_type lora \
--export_dir output_dir \
--export_size 2 \
--export_device cpu \
--export_legacy_format False
- Mobile Agent Optimization: Customized training configurations for mobile agent tasks
- Multi-Modal Support: Enhanced support for visual-language models
- Flexible Training: Support for various training methods including LoRA fine-tuning
- Evaluation Tools: Built-in tools for model evaluation and benchmarking
- Easy Deployment: Streamlined process for model deployment and testing
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please feel free to submit a Pull Request.
For any questions or suggestions, please open an issue in this repository.