Skip to content

Qwen-Applications/ATP-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ATP-Bench: Evaluating Agentic Tool Planning for Interleaved Generation

ATP-Bench is a benchmark designed to evaluate the capability of Multimodal Large Language Models (MLLMs) to act as Agents in generating interleaved text-and-image content. It focuses on "Agentic Tool Planning," where the model autonomously decides when, where, and which tools (e.g., text-to-image, image-search) to invoke to produce a coherent multimodal response.

📂 Project Structure

.
├── data/
│   ├── data.jsonl          # Core benchmark dataset
│   └── images.tar.gz       # Image resources (Managed by Git LFS)
├── inference/
│   ├── inference.py        # Main inference script (GPT/Gemini)
│   ├── inference_intern.py # Script for InternLM series
│   ├── inference_llama.py  # Script for Llama series
│   └── prompt.txt          # System prompt for Agentic Tool Planning
├── eval/
│   ├── MAM_judge.py        # Multi-Agent MLLM-as-a-Judge (MAM) system
│   └── eval.py             # Result processing & Leaderboard generation
└── src/
    └── utils.py            # Utility functions (API calls, Image encoding)

📊 Dataset Detail

The data/data.jsonl file contains 7,702 QA pairs. Key fields include:

  • trace_id: Unique identifier for each sample.
  • query: Original user instruction.
  • image_query: User uploaded image query.
  • doc: Relevant context/documents.
  • images: Path/URL to document images.
  • gt: Ground Truth (for reference).

🛠️ Getting Started

1. Environment Setup

Clone the repository and install the dependencies:

git lfs pull
pip install -r requirements.txt

2. Data Preparation

Extract the image resources:

tar -xzvf data/images.tar.gz -C data/

3. API Configuration

If using OpenAI or Gemini models, set your API keys as environment variables:

export OPENAI_API_KEY='your-api-key'
export GOOGLE_API_KEY='your-api-key'

Make sure you modify request_gpt and request_gemini in src/utils.py to your API implementation.

🚀 Running Evaluation Pipeline

The evaluation follows a three-step pipeline: Inference -> MAM-Judge -> Statistics Calculation.

Step 1: Inference

Run the inference script to generate model outputs. The script supports multi-threading and breakpoint resumption.

#For API-based reference
python inference/inference.py --base_dir PATH_TO_THE_REPO --model gpt-4o --workers 10 --output_dir results/

#For local models
python inference/inference_llama.py --base_dir PATH_TO_THE_REPO --model_path PATH_TO_YOUR_MODEL/Llama-3.2-11B
python inference/inference_intern.py --base_dir PATH_TO_THE_REPO --model_path PATH_TO_YOUR_MODEL/InternVL3_5-14B

Step 2: MAM-Judge (Multi-Agent Evaluation)

We use MAM_judge.py, a reference-free evaluation system. Multiple Expert Agents collaborate to assess:

  • Tool-call precision: Did the model call the right tool at the right time?
  • Missed opportunities: Should the model have used a tool but didn't?
  • Overall quality: The coherence and factuality of the interleaved response.
python eval/MAM_judge.py --base_dir PATH_TO_THE_REPO --input_path results/gpt-4o.jsonl

Step 3: Calculate Scores

Generate the final leaderboard and fine-grained results (by category and intent):

python eval/eval.py

Make sure you change the path to your results folder in the code.

📧 Contact

For any questions regarding the benchmark or code, please open an issue or contact the authors.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages