Add VLRMBench Support #1259

Winston-Yuan · 2025-10-01T12:03:05Z

Add VLRMBench Support

This PR integrates VLRMBench into VLMEvalKit for evaluating vision-language models on reasoning error detection tasks.

About VLRMBench

VLRMBench is introduced in our paper "VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models".

What it does: VLRMBench evaluates whether models can identify errors in multi-step visual reasoning processes. It includes 12 sub-datasets covering different error types such as:

Attribute/existence hallucinations
Detail/location errors
Step correctness verification
Foresight reasoning
Multi-solution identification

Dataset size: 10K+ samples across 12 tasks with visual reasoning chains and error annotations.

Changes Made

New Files

vlmeval/dataset/vlrmbench.py - VLRMBench dataset implementation with automatic HuggingFace download

Modified Files

vlmeval/dataset/__init__.py - Register 12 VLRMBench dataset classes

Usage

# Evaluate on a single sub-dataset
python run.py --data VLRMBench_attribute_hallucination --model Qwen2-VL-7B-Instruct

# Evaluate on multiple sub-datasets
python run.py --data VLRMBench_foresight VLRMBench_step_correctness --model Qwen2-VL-7B-Instruct

Supported Datasets

All 12 sub-datasets are available:

VLRMBench_attribute_hallucination, VLRMBench_detail_error, 
VLRMBench_step_correctness, VLRMBench_existence_hallucination,
VLRMBench_image_ref_error, VLRMBench_location_error,
VLRMBench_most_confidence, VLRMBench_redundant_det,
VLRMBench_foresight, VLRMBench_multi_solution,
VLRMBench_error_correction, VLRMBench_error_reason_analysis

Citation

@article{ruan2025vlrmbench,
  title={VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models},
  author={Ruan, Jiacheng and Yuan, Wenzhen and Gao, Xian and Guo, Ye and Zhang, Daoxin and Xu, Zhe and Hu, Yao and Liu, Ting and Fu, Yuzhuo},
  journal={arXiv preprint arXiv:2503.07478},
  year={2025}
}

Data automatically downloads from HuggingFace (Winston-Yuan/VLRMBench)

FangXinyu-0913 · 2025-10-10T02:44:38Z

Hi @Winston-Yuan, I ran infer with one subset, but it seems no evaluation was conducted. And it seems that VLRMBenchBase class does not define the evaluate function, so it will not evaluate, would you please review and modify your code.

My command:
torchrun --nproc-per-node=2 run.py --model Qwen2.5-VL-7B-Instruct --data VLRMBench_attribute_hallucination --verbose

Winston-Yuan · 2025-10-14T09:44:57Z

您好，我用ailab的飞书联系您了，请问方便飞书交流吗？

SYuan03 · 2025-10-16T10:28:55Z

Hi @Winston-Yuan , I noticed that the latest code doesn’t seem to include the sub-item evaluations mentioned in the paper. Could you please add support for those additional subsets? That would help us align our evaluations more quickly and merge the PR sooner.

Winston-Yuan · 2025-10-16T10:56:53Z

Our current code supports eight types of Step-based tasks ('VLRMBench') and Multi-solution Judgment ('VLRMBench_MultiSolution'). Since Criticism-based tasks require the invocation of additional evaluation models for assessment, we recommend using the code from JCruan519/VLRMBench for the evaluation.

Winston-Yuan · 2025-10-16T10:58:02Z

Our current code supports eight types of Step-based tasks ('VLRMBench') and Multi-solution Judgment ('VLRMBench_MultiSolution'). Since Criticism-based tasks require the invocation of additional evaluation models for assessment, we recommend using the code from JCruan519/VLRMBench for the evaluation.

Is it necessary for us to add a readme to explain this in the code?

Winston-Yuan added 3 commits October 1, 2025 19:08

add vlrmbench dataset

5cc0289

Translate comments to English

7ca05b6

Fix code formatting issues

86fbf9b

FangXinyu-0913 self-assigned this Oct 10, 2025

Winston-Yuan and others added 6 commits October 14, 2025 22:20

Reconstructed the dataset and the evaluation code of vlrmbench

dbea3ff

Refactor and translate comments in VLRMBench dataset code to English

ee9f8cc

Consolidate dataset imports in __init__.py

52b88d3

Merge branch 'main' into main

890434b

Add multi_solution evaluation support in VLRMBench dataset

9cc6032

Merge branch 'main' of github.com:Winston-Yuan/VLMEvalKit

10dbaea

FangXinyu-0913 assigned SYuan03 and unassigned FangXinyu-0913 Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add VLRMBench Support #1259

Add VLRMBench Support #1259

Uh oh!

Winston-Yuan commented Oct 1, 2025

Uh oh!

FangXinyu-0913 commented Oct 10, 2025

Uh oh!

Winston-Yuan commented Oct 14, 2025

Uh oh!

SYuan03 commented Oct 16, 2025

Uh oh!

Winston-Yuan commented Oct 16, 2025

Uh oh!

Winston-Yuan commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add VLRMBench Support #1259

Are you sure you want to change the base?

Add VLRMBench Support #1259

Uh oh!

Conversation

Winston-Yuan commented Oct 1, 2025

Add VLRMBench Support

About VLRMBench

Changes Made

New Files

Modified Files

Usage

Supported Datasets

Citation

Uh oh!

FangXinyu-0913 commented Oct 10, 2025

Uh oh!

Winston-Yuan commented Oct 14, 2025

Uh oh!

SYuan03 commented Oct 16, 2025

Uh oh!

Winston-Yuan commented Oct 16, 2025

Uh oh!

Winston-Yuan commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants