ATP-Bench: Evaluating Agentic Tool Planning for Interleaved Generation

ATP-Bench is a benchmark designed to evaluate the capability of Multimodal Large Language Models (MLLMs) to act as Agents in generating interleaved text-and-image content. It focuses on "Agentic Tool Planning," where the model autonomously decides when, where, and which tools (e.g., text-to-image, image-search) to invoke to produce a coherent multimodal response.

📂 Project Structure

.
├── data/
│   ├── data.jsonl          # Core benchmark dataset
│   └── images.tar.gz       # Image resources (Managed by Git LFS)
├── inference/
│   ├── inference.py        # Main inference script (GPT/Gemini)
│   ├── inference_intern.py # Script for InternLM series
│   ├── inference_llama.py  # Script for Llama series
│   └── prompt.txt          # System prompt for Agentic Tool Planning
├── eval/
│   ├── MAM_judge.py        # Multi-Agent MLLM-as-a-Judge (MAM) system
│   └── eval.py             # Result processing & Leaderboard generation
└── src/
    └── utils.py            # Utility functions (API calls, Image encoding)

📊 Dataset Detail

The data/data.jsonl file contains 7,702 QA pairs. Key fields include:

trace_id: Unique identifier for each sample.
query: Original user instruction.
image_query: User uploaded image query.
doc: Relevant context/documents.
images: Path/URL to document images.
gt: Ground Truth (for reference).

🛠️ Getting Started

1. Environment Setup

Clone the repository and install the dependencies:

git lfs pull
pip install -r requirements.txt

2. Data Preparation

Extract the image resources:

tar -xzvf data/images.tar.gz -C data/

3. API Configuration

If using OpenAI or Gemini models, set your API keys as environment variables:

export OPENAI_API_KEY='your-api-key'
export GOOGLE_API_KEY='your-api-key'

Make sure you modify request_gpt and request_gemini in src/utils.py to your API implementation.

🚀 Running Evaluation Pipeline

The evaluation follows a three-step pipeline: Inference -> MAM-Judge -> Statistics Calculation.

Step 1: Inference

Run the inference script to generate model outputs. The script supports multi-threading and breakpoint resumption.

#For API-based reference
python inference/inference.py --base_dir PATH_TO_THE_REPO --model gpt-4o --workers 10 --output_dir results/

#For local models
python inference/inference_llama.py --base_dir PATH_TO_THE_REPO --model_path PATH_TO_YOUR_MODEL/Llama-3.2-11B
python inference/inference_intern.py --base_dir PATH_TO_THE_REPO --model_path PATH_TO_YOUR_MODEL/InternVL3_5-14B

Step 2: MAM-Judge (Multi-Agent Evaluation)

We use MAM_judge.py, a reference-free evaluation system. Multiple Expert Agents collaborate to assess:

Tool-call precision: Did the model call the right tool at the right time?
Missed opportunities: Should the model have used a tool but didn't?
Overall quality: The coherence and factuality of the interleaved response.

python eval/MAM_judge.py --base_dir PATH_TO_THE_REPO --input_path results/gpt-4o.jsonl

Step 3: Calculate Scores

Generate the final leaderboard and fine-grained results (by category and intent):

python eval/eval.py

Make sure you change the path to your results folder in the code.

📧 Contact

For any questions regarding the benchmark or code, please open an issue or contact the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
eval		eval
inference		inference
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ATP-Bench: Evaluating Agentic Tool Planning for Interleaved Generation

📂 Project Structure

📊 Dataset Detail

🛠️ Getting Started

1. Environment Setup

2. Data Preparation

3. API Configuration

🚀 Running Evaluation Pipeline

Step 1: Inference

Step 2: MAM-Judge (Multi-Agent Evaluation)

Step 3: Calculate Scores

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ATP-Bench: Evaluating Agentic Tool Planning for Interleaved Generation

📂 Project Structure

📊 Dataset Detail

🛠️ Getting Started

1. Environment Setup

2. Data Preparation

3. API Configuration

🚀 Running Evaluation Pipeline

Step 1: Inference

Step 2: MAM-Judge (Multi-Agent Evaluation)

Step 3: Calculate Scores

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages