ATP-Bench is a benchmark designed to evaluate the capability of Multimodal Large Language Models (MLLMs) to act as Agents in generating interleaved text-and-image content. It focuses on "Agentic Tool Planning," where the model autonomously decides when, where, and which tools (e.g., text-to-image, image-search) to invoke to produce a coherent multimodal response.
.
├── data/
│ ├── data.jsonl # Core benchmark dataset
│ └── images.tar.gz # Image resources (Managed by Git LFS)
├── inference/
│ ├── inference.py # Main inference script (GPT/Gemini)
│ ├── inference_intern.py # Script for InternLM series
│ ├── inference_llama.py # Script for Llama series
│ └── prompt.txt # System prompt for Agentic Tool Planning
├── eval/
│ ├── MAM_judge.py # Multi-Agent MLLM-as-a-Judge (MAM) system
│ └── eval.py # Result processing & Leaderboard generation
└── src/
└── utils.py # Utility functions (API calls, Image encoding)
The data/data.jsonl file contains 7,702 QA pairs. Key fields include:
trace_id: Unique identifier for each sample.query: Original user instruction.image_query: User uploaded image query.doc: Relevant context/documents.images: Path/URL to document images.gt: Ground Truth (for reference).
Clone the repository and install the dependencies:
git lfs pull
pip install -r requirements.txtExtract the image resources:
tar -xzvf data/images.tar.gz -C data/If using OpenAI or Gemini models, set your API keys as environment variables:
export OPENAI_API_KEY='your-api-key'
export GOOGLE_API_KEY='your-api-key'Make sure you modify request_gpt and request_gemini in src/utils.py to your API implementation.
The evaluation follows a three-step pipeline: Inference -> MAM-Judge -> Statistics Calculation.
Run the inference script to generate model outputs. The script supports multi-threading and breakpoint resumption.
#For API-based reference
python inference/inference.py --base_dir PATH_TO_THE_REPO --model gpt-4o --workers 10 --output_dir results/
#For local models
python inference/inference_llama.py --base_dir PATH_TO_THE_REPO --model_path PATH_TO_YOUR_MODEL/Llama-3.2-11B
python inference/inference_intern.py --base_dir PATH_TO_THE_REPO --model_path PATH_TO_YOUR_MODEL/InternVL3_5-14BWe use MAM_judge.py, a reference-free evaluation system. Multiple Expert Agents collaborate to assess:
- Tool-call precision: Did the model call the right tool at the right time?
- Missed opportunities: Should the model have used a tool but didn't?
- Overall quality: The coherence and factuality of the interleaved response.
python eval/MAM_judge.py --base_dir PATH_TO_THE_REPO --input_path results/gpt-4o.jsonlGenerate the final leaderboard and fine-grained results (by category and intent):
python eval/eval.pyMake sure you change the path to your results folder in the code.
For any questions regarding the benchmark or code, please open an issue or contact the authors.