Official code for "Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks" (ICLR 2026).
- [2026/01] Ref-Adv accepted to ICLR 2026!
- [2026/01] Evaluation code and model predictions released.
Ref-Adv is a referring expression comprehension (REC) benchmark designed to probe the visual reasoning capabilities of multimodal large language models (MLLMs). Standard REC benchmarks contain shortcuts that allow models to succeed without true visual reasoning. Ref-Adv addresses this by pairing complex referring expressions with hard visual distractors, featuring an average expression length of 11.5 words, 4.01 distractors per image (each case contains at least 2 distractors), and a 21.25% negation ratio.
Ref-Adv-s is the publicly released subset containing 1,142 cases with evaluation code and model predictions. The dataset is uploaded to HuggingFace.
git clone https://github.com/dddraxxx/Ref-Adv.git
cd Ref-Adv
pip install -r requirements.txtAll model predictions are included in outputs/qwen/. You can directly run report.py on these files to reproduce the results table without re-running inference.
We serve the Qwen VLM series using their official repositories with vLLM:
Generation parameters (temperature, top_p, etc.) are specified in configs/qwen.yaml.
Start an OpenAI-compatible server first (model must match the run's model_full_name), then run:
python run.py \
--config configs/qwen.yaml \
--run-name qwen35a35b_directOutput: outputs/qwen/<run_name>_predictions.jsonl
python report.py \
--glob 'outputs/qwen/*_predictions.jsonl' \
--output-md eval_table.mdThe report includes Acc@0.5, Acc@0.75, Acc@0.9, parse-fail count, and distractor-bin breakdowns (2-3, 4-6, >=7).
Click to expand prediction JSONL fields
Each line contains:
| Field | Description |
|---|---|
row_idx |
Dataset row index |
file_name |
Image filename |
normal_caption |
Referring expression |
image_source |
COCO or OpenImages |
human_authored |
Whether the caption is human-written |
use_negation |
Whether the caption uses negation |
distractor_count |
Number of distractors in the image |
gt_bbox_xyxy |
Ground-truth bounding box (absolute xyxy) |
pred_box_xyxy_first |
Predicted bounding box |
first_iou |
IoU between prediction and ground truth |
first_hit |
Whether IoU >= 0.5 |
parse_error |
Whether bbox parsing failed |
retry_followup_used |
Whether a follow-up retry was used |
model_full_name |
Model identifier |
prompt_id |
direct or cot |
pred_box_expected_format |
abs_xyxy or norm_1000_xyxy |
See the full results table (all 46 configurations) at ref-adv.github.io/#results.
Best model per Qwen family on Ref-Adv-s (temperature=0.0):
| Model | CoT | Acc@0.5 | Acc@0.75 | Acc@0.9 |
|---|---|---|---|---|
| Human Expert (High)* | -- | 90.3 | -- | -- |
| Human Expert (Medium)* | -- | 80.6 | -- | -- |
| Qwen2.5-VL-72B | 54.0 | 40.1 | 18.0 | |
| Qwen2.5-VL-72B | ✓ | 52.4 | 39.0 | 18.3 |
| Qwen3-VL-235B-A22B-Thinking | ✓ | 67.1 | 53.6 | 31.8 |
| Qwen3-VL-32B-Thinking | ✓ | 65.6 | 52.8 | 31.6 |
| Qwen3-VL-8B-Thinking | ✓ | 59.5 | 48.2 | 27.3 |
| Qwen3-VL-4B-Thinking | ✓ | 57.6 | 45.5 | 27.8 |
| Qwen3.5-397B-A17B-FP8 | ✓ | 68.0 | 55.6 | 34.2 |
| Qwen3.5-122B-A10B | ✓ | 67.2 | 55.0 | 35.1 |
| Qwen3.5-27B | ✓ | 67.3 | 54.9 | 32.7 |
* Human expert results evaluated on a randomly selected subset.
@article{dong2026refadv,
title = {Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks},
author = {Qihua Dong and Kuo Yang and Lin Ju and Handong Zhao and Yitian Zhang and Yizhou Wang and Huimin Zeng and Jianglin Lu and Yun Fu},
year = {2026},
journal = {arXiv preprint arXiv: 2602.23898}
}Code is released under the Apache 2.0 License. Dataset is available on HuggingFace under its own license.
