Ref-Adv

Official code for "Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks" (ICLR 2026).

🔥 News

[2026/01] Ref-Adv accepted to ICLR 2026!
[2026/01] Evaluation code and model predictions released.

📖 Introduction

Ref-Adv is a referring expression comprehension (REC) benchmark designed to probe the visual reasoning capabilities of multimodal large language models (MLLMs). Standard REC benchmarks contain shortcuts that allow models to succeed without true visual reasoning. Ref-Adv addresses this by pairing complex referring expressions with hard visual distractors, featuring an average expression length of 11.5 words, 4.01 distractors per image (each case contains at least 2 distractors), and a 21.25% negation ratio.

Ref-Adv-s is the publicly released subset containing 1,142 cases with evaluation code and model predictions. The dataset is uploaded to HuggingFace.

⚙️ Setup

git clone https://github.com/dddraxxx/Ref-Adv.git
cd Ref-Adv
pip install -r requirements.txt

🧪 Evaluation

Pre-computed Predictions

All model predictions are included in outputs/qwen/. You can directly run report.py on these files to reproduce the results table without re-running inference.

VLM Serving

We serve the Qwen VLM series using their official repositories with vLLM:

Generation parameters (temperature, top_p, etc.) are specified in configs/qwen.yaml.

Run One Eval

Start an OpenAI-compatible server first (model must match the run's model_full_name), then run:

python run.py \
  --config configs/qwen.yaml \
  --run-name qwen35a35b_direct

Output: outputs/qwen/<run_name>_predictions.jsonl

Compute Metrics

python report.py \
  --glob 'outputs/qwen/*_predictions.jsonl' \
  --output-md eval_table.md

The report includes Acc@0.5, Acc@0.75, Acc@0.9, parse-fail count, and distractor-bin breakdowns (2-3, 4-6, >=7).

JSONL Schema

Click to expand prediction JSONL fields

Each line contains:

Field	Description
`row_idx`	Dataset row index
`file_name`	Image filename
`normal_caption`	Referring expression
`image_source`	COCO or OpenImages
`human_authored`	Whether the caption is human-written
`use_negation`	Whether the caption uses negation
`distractor_count`	Number of distractors in the image
`gt_bbox_xyxy`	Ground-truth bounding box (absolute xyxy)
`pred_box_xyxy_first`	Predicted bounding box
`first_iou`	IoU between prediction and ground truth
`first_hit`	Whether IoU >= 0.5
`parse_error`	Whether bbox parsing failed
`retry_followup_used`	Whether a follow-up retry was used
`model_full_name`	Model identifier
`prompt_id`	`direct` or `cot`
`pred_box_expected_format`	`abs_xyxy` or `norm_1000_xyxy`

📊 Results

See the full results table (all 46 configurations) at ref-adv.github.io/#results.

Best model per Qwen family on Ref-Adv-s (temperature=0.0):

Model	CoT	Acc@0.5	Acc@0.75	Acc@0.9
Human Expert (High)*	--	90.3	--	--
Human Expert (Medium)*	--	80.6	--	--
Qwen2.5-VL-72B		54.0	40.1	18.0
Qwen2.5-VL-72B	✓	52.4	39.0	18.3
Qwen3-VL-235B-A22B-Thinking	✓	67.1	53.6	31.8
Qwen3-VL-32B-Thinking	✓	65.6	52.8	31.6
Qwen3-VL-8B-Thinking	✓	59.5	48.2	27.3
Qwen3-VL-4B-Thinking	✓	57.6	45.5	27.8
Qwen3.5-397B-A17B-FP8	✓	68.0	55.6	34.2
Qwen3.5-122B-A10B	✓	67.2	55.0	35.1
Qwen3.5-27B	✓	67.3	54.9	32.7

* Human expert results evaluated on a randomly selected subset.

📝 Citation

@article{dong2026refadv,
  title   = {Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks},
  author  = {Qihua Dong and Kuo Yang and Lin Ju and Handong Zhao and Yitian Zhang and Yizhou Wang and Huimin Zeng and Jianglin Lu and Yun Fu},
  year    = {2026},
  journal = {arXiv preprint arXiv: 2602.23898}
}

📄 License

Code is released under the Apache 2.0 License. Dataset is available on HuggingFace under its own license.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
configs		configs
outputs/qwen		outputs/qwen
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
report.py		report.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ref-Adv

🔥 News

📖 Introduction

⚙️ Setup

🧪 Evaluation

Pre-computed Predictions

VLM Serving

Run One Eval

Compute Metrics

JSONL Schema

📊 Results

📝 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Ref-Adv

🔥 News

📖 Introduction

⚙️ Setup

🧪 Evaluation

Pre-computed Predictions

VLM Serving

Run One Eval

Compute Metrics

JSONL Schema

📊 Results

📝 Citation

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages