-
Notifications
You must be signed in to change notification settings - Fork 511
Add VLRMBench Support #1259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add VLRMBench Support #1259
Conversation
Hi @Winston-Yuan, I ran infer with one subset, but it seems no evaluation was conducted. And it seems that My command: |
您好,我用ailab的飞书联系您了,请问方便飞书交流吗? |
Hi @Winston-Yuan , I noticed that the latest code doesn’t seem to include the sub-item evaluations mentioned in the paper. Could you please add support for those additional subsets? That would help us align our evaluations more quickly and merge the PR sooner. |
Our current code supports eight types of Step-based tasks ('VLRMBench') and Multi-solution Judgment ('VLRMBench_MultiSolution'). Since Criticism-based tasks require the invocation of additional evaluation models for assessment, we recommend using the code from JCruan519/VLRMBench for the evaluation. |
Is it necessary for us to add a readme to explain this in the code? |
Add VLRMBench Support
This PR integrates VLRMBench into VLMEvalKit for evaluating vision-language models on reasoning error detection tasks.
About VLRMBench
VLRMBench is introduced in our paper "VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models".
What it does: VLRMBench evaluates whether models can identify errors in multi-step visual reasoning processes. It includes 12 sub-datasets covering different error types such as:
Dataset size: 10K+ samples across 12 tasks with visual reasoning chains and error annotations.
Changes Made
New Files
vlmeval/dataset/vlrmbench.py
- VLRMBench dataset implementation with automatic HuggingFace downloadModified Files
vlmeval/dataset/__init__.py
- Register 12 VLRMBench dataset classesUsage
Supported Datasets
All 12 sub-datasets are available:
Citation
Winston-Yuan/VLRMBench
)