Skip to content

Conversation

Winston-Yuan
Copy link

Add VLRMBench Support

This PR integrates VLRMBench into VLMEvalKit for evaluating vision-language models on reasoning error detection tasks.

About VLRMBench

VLRMBench is introduced in our paper "VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models".

What it does: VLRMBench evaluates whether models can identify errors in multi-step visual reasoning processes. It includes 12 sub-datasets covering different error types such as:

  • Attribute/existence hallucinations
  • Detail/location errors
  • Step correctness verification
  • Foresight reasoning
  • Multi-solution identification

Dataset size: 10K+ samples across 12 tasks with visual reasoning chains and error annotations.

Changes Made

New Files

  • vlmeval/dataset/vlrmbench.py - VLRMBench dataset implementation with automatic HuggingFace download

Modified Files

  • vlmeval/dataset/__init__.py - Register 12 VLRMBench dataset classes

Usage

# Evaluate on a single sub-dataset
python run.py --data VLRMBench_attribute_hallucination --model Qwen2-VL-7B-Instruct

# Evaluate on multiple sub-datasets
python run.py --data VLRMBench_foresight VLRMBench_step_correctness --model Qwen2-VL-7B-Instruct

Supported Datasets

All 12 sub-datasets are available:

VLRMBench_attribute_hallucination, VLRMBench_detail_error, 
VLRMBench_step_correctness, VLRMBench_existence_hallucination,
VLRMBench_image_ref_error, VLRMBench_location_error,
VLRMBench_most_confidence, VLRMBench_redundant_det,
VLRMBench_foresight, VLRMBench_multi_solution,
VLRMBench_error_correction, VLRMBench_error_reason_analysis

Citation

@article{ruan2025vlrmbench,
  title={VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models},
  author={Ruan, Jiacheng and Yuan, Wenzhen and Gao, Xian and Guo, Ye and Zhang, Daoxin and Xu, Zhe and Hu, Yao and Liu, Ting and Fu, Yuzhuo},
  journal={arXiv preprint arXiv:2503.07478},
  year={2025}
}
  • Data automatically downloads from HuggingFace (Winston-Yuan/VLRMBench)

@FangXinyu-0913
Copy link
Collaborator

Hi @Winston-Yuan, I ran infer with one subset, but it seems no evaluation was conducted. And it seems that VLRMBenchBase class does not define the evaluate function, so it will not evaluate, would you please review and modify your code.

My command:
torchrun --nproc-per-node=2 run.py --model Qwen2.5-VL-7B-Instruct --data VLRMBench_attribute_hallucination --verbose

@FangXinyu-0913 FangXinyu-0913 self-assigned this Oct 10, 2025
@Winston-Yuan
Copy link
Author

您好,我用ailab的飞书联系您了,请问方便飞书交流吗?

@SYuan03
Copy link
Collaborator

SYuan03 commented Oct 16, 2025

Hi @Winston-Yuan , I noticed that the latest code doesn’t seem to include the sub-item evaluations mentioned in the paper. Could you please add support for those additional subsets? That would help us align our evaluations more quickly and merge the PR sooner.

@Winston-Yuan
Copy link
Author

Our current code supports eight types of Step-based tasks ('VLRMBench') and Multi-solution Judgment ('VLRMBench_MultiSolution'). Since Criticism-based tasks require the invocation of additional evaluation models for assessment, we recommend using the code from JCruan519/VLRMBench for the evaluation.

@Winston-Yuan
Copy link
Author

Our current code supports eight types of Step-based tasks ('VLRMBench') and Multi-solution Judgment ('VLRMBench_MultiSolution'). Since Criticism-based tasks require the invocation of additional evaluation models for assessment, we recommend using the code from JCruan519/VLRMBench for the evaluation.

Is it necessary for us to add a readme to explain this in the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants