Official repository for the paper "SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning".
🌟 For more details, please refer to the project page with dataset exploration and leadboard: SeePhys Webpage.
[🌐 Webpage] [📖 Paper] [🤗 Huggingface Dataset] [🏆 Leaderboard][🎉Challenge]
- 💥 News
- 👀 About SeePhys
- 🏆 Leaderboard
- 💪 Contributing to our Leaderboard
- 🚀 Evaluation with VLMEvalKit
- 📐 Dataset Examples
- 📈 Citation
- 🤝 Contributors
- [2025.09.19] 🚀 SeePhys is accepted by NeurIPS 2025! See you in San Diego :)
- [2025.09.16] 🏆 Technical report of the 1st place solution to our ICML 2025 SeePhys challenge is released! Learn more at the Technical Report.
- [2025.08.11] 🔥 GPT-5 (high) now achieves a new SOTA with 63.2% accuracy, surpassing Gemini 2.5 Pro by 8.3%!
- [2025.08.11] 🔥 We have released the performance of human experts, surpassing Gemini 2.5 Pro by 32.4%!
- [2025.07.11] 💥 Skywork-R1V3 Outperforms Qwen2.5-VL-72B with 32.0% on SeePhys! Learn more at the Skywork-R1V3 blog.
- [2025.07.11] 🏆 The competition results of 2nd AI for Math Workshop at ICML 2025 have been announced, with the champion achieving the highest accuracy of 60.56%!
- [2025.07.07] 🚀 We release the full set with ground truth at [🤗 Huggingface Dataset]!
- [2025.05.27] 🔥 The arXiv paper is online!
- [2025.05.24] 🚀 We release the test set without ground truth at [🤗 Huggingface Dataset], and the evaluation code!
- [2025.05.24] 🔥 We release the evaluation code .using VLMEvalKit
- [2025.05.21] 🎉 Our SeePhys is officially open for challenges at the 2nd AI for Math Workshop at ICML 2025!
SeePhys is a full spectrum multimodal benchmark for evaluating physics reasoning across different knowledge levels.
It comprises 2,000 rigorously validated questions covering a comprehensive range of knoledge levels from middle school to PhD qualifying exam levels. These questions span 7 major fields of both classical and modern physics. To assess the extent to which different models rely on visual information for reasoning, we curate two subsets with different visual information enrichment and additionally compile supplementary copies of 2,000 purely visual instances where all problem statements in texts are presented in picture form. Through meticulous selection of 21 diagram types by domain experts, each problem challenges frontier MLLMs to integrate domain knowledge with visual understanding of physics diagrams (e.g., Feynman diagrams for particle interactions and Circuit diagrams for Electromagnetism).
The figure below showcases examples of Vision-Essential and Vision-Optional samples.
Our experiments reveal that MLLMs encounter significant challenges in solving Vision-Essential problems, whereas for Vision-Optional problems, even when images only provide supplementary information, they can still enhance the model's problem-solving capabilities.
With **SeePhys**, we conduct extensive experiments to evaluate 28 leading LLMs and MLLMs such as o4-mini and Gemini-2.5-Pro. The results reveal that even with extensive chain-of-thought, none of the current models could surpass 55\% accuracy. Our analysis reveals that even non-essential diagrams can enhance physics reasoning performance when presented to MLLMs.Accuracy scores of LLMs:
| # | LLMs | Mid | High | BO | AO | UG | SUG | MA | PhD | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Human Expert🥇 | 100.0 | 94.4 | 92.3 | 71.7 | 92.9 | 94.7 | 100.0 | 83.0 | 86.5 |
| 2 | DeepSeek-R1🥈 | 54.9 | 46.9 | 47.7 | 31.9 | 49.9 | 34.2 | 49.0 | 41.2 | 42.2 |
| 3 | DeepSeek-V3🥉 | 53.9 | 42.6 | 36.4 | 22.8 | 45.4 | 29.7 | 35.9 | 37.5 | 36.0 |
| 4 | Qwen3-235B-A22B | 47.1 | 33.7 | 31.8 | 20.4 | 41.2 | 25.1 | 31.7 | 30.7 | 31.1 |
| 5 | QwQ-32B | 47.1 | 42.2 | 44.9 | 15.5 | 40.0 | 20.1 | 32.4 | 24.0 | 29.7 |
| 6 | R1-Distilled-Llama-70B | 48.0 | 41.4 | 34.6 | 14.2 | 31.5 | 16.0 | 28.9 | 25.9 | 26.9 |
| 7 | Llama-4-Scout-17B | 48.0 | 36.5 | 31.8 | 11.3 | 28.5 | 14.2 | 28.3 | 26.1 | 24.8 |
| 8 | Qwen2.5-72B | 41.2 | 40.2 | 25.2 | 8.2 | 26.8 | 12.8 | 18.6 | 17.8 | 21.1 |
| 9 | Gemma3-27B | 21.6 | 36.5 | 30.8 | 5.1 | 23.1 | 9.1 | 15.2 | 11.9 | 16.9 |
| 10 | Llama-3.1-8B | 26.5 | 15.7 | 17.8 | 3.9 | 7.6 | 3.7 | 10.3 | 8.4 | 9.2 |
Accuracy scores of MLLMs:
| # | MLLMs | Mid | High | BO | AO | UG | SUG | MA | PhD | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Human Expert🥇 | 100.0 | 94.4 | 92.3 | 71.7 | 92.9 | 94.7 | 100.0 | 83.0 | 86.5 |
| 2 | GPT-5 (high)🥈 | 75.5 | 70.7 | 70.1 | 55.5 | 65.9 | 63.9 | 60.7 | 60.4 | 63.2 |
| 2 | Gemini-2.5-Pro🥉 | 69.6 | 66.7 | 64.5 | 46.7 | 64.2 | 50.2 | 53.8 | 44.2 | 54.9 |
| 3 | o4-mini | 66.7 | 61.8 | 56.1 | 41.8 | 53.8 | 45.7 | 51.0 | 53.4 | 51.9 |
| 4 | o1 | 60.8 | 56.6 | 50.5 | 32.5 | 54.4 | 40.6 | 52.4 | 40.4 | 45.6 |
| 5 | Doubao-1.5-pro | 70.6 | 58.2 | 49.5 | 29.2 | 56.6 | 34.7 | 40.7 | 37.5 | 43.9 |
| 6 | o3-mini | 47.1 | 46.2 | 39.3 | 28.3 | 47.0 | 36.1 | 48.3 | 42.3 | 40.3 |
| 7 | GPT-4.1 | 51.0 | 52.6 | 41.1 | 17.0 | 39.7 | 31.1 | 42.1 | 35.6 | 35.3 |
| 8 | Claude-3.7-Sonnet | 52.9 | 51.8 | 43.0 | 16.7 | 41.4 | 26.5 | 33.8 | 32.4 | 34.6 |
| 9 | Qwen2.5-VL-72B-Inst | 61.8 | 42.2 | 29.0 | 10.4 | 29.9 | 14.6 | 18.6 | 19.4 | 24.2 |
| 10 | QVQ-72b-preview | 38.2 | 36.5 | 30.8 | 11.3 | 25.9 | 14.2 | 26.2 | 20.2 | 22.5 |
| 11 | GPT-4o | 37.3 | 39.0 | 34.6 | 7.5 | 23.4 | 15.5 | 24.1 | 21.8 | 21.9 |
| 12 | Llama-3.2-90B-Vision | 21.6 | 25.7 | 22.4 | 3.9 | 9.3 | 10.0 | 12.4 | 8.9 | 11.7 |
| 13 | Qwen2.5-VL-7B-Inst | 39.2 | 25.3 | 21.5 | 4.2 | 8.7 | 5.9 | 10.3 | 7.3 | 11.6 |
| 14 | Qwen2.5-VL-3B-Inst | 30.4 | 21.3 | 13.1 | 2.9 | 10.4 | 7.3 | 6.2 | 6.2 | 9.8 |
| 15 | Qwen2-VL-7B-Inst | 24.5 | 17.3 | 14.0 | 4.4 | 8.5 | 4.6 | 10.3 | 7.0 | 9.2 |
| 16 | LLaVA-NeXT-7B | 14.5 | 12.7 | 11.2 | 5.5 | 13.2 | 8.2 | 11.0 | 9.4 | 8.7 |
| 17 | Llama3.2-11B-Vision | 23.5 | 18.5 | 14.0 | 4.2 | 5.4 | 3.7 | 4.8 | 7.5 | 8.3 |
| 18 | Phi-4-multimodal | 20.6 | 12.4 | 12.1 | 4.4 | 7.0 | 5.0 | 8.3 | 4.9 | 7.6 |
| 19 | InternVL2.5-8B | 17.6 | 12.4 | 9.3 | 2.9 | 5.6 | 3.2 | 4.1 | 5.1 | 6.2 |
| 20 | LLaVA-OneVision-7B | 20.6 | 10.8 | 12.1 | 2.7 | 5.4 | 2.3 | 6.2 | 5.4 | 6.1 |
Our SeePhys is now open for submissions at the ICML 2025 Challenge on Automated Math Reasoning and Extensions! To evaluate your model, please submit benchmark results to our website following the official guidelines.
We strongly encourage all participants to concurrently submit their technical reports to the ICML 2025 AI for Math Workshop.
- Submission Guidelines: https://www.codabench.org/competitions/7925/
- Challenge Details: https://sites.google.com/view/ai4mathworkshopicml2025/challenge
- Workshop Details: https://sites.google.com/view/ai4mathworkshopicml2025/home
We have provided the evaluation code using VLMEvalKit for our SeePhys. To ensure fairness in the challenge, we will release the ground-truth to the public on July 18th. Create the environment:
cd seephys-project
pip install -e .
Set your API key:
export OPENAI_API_KEY=
export OPENAI_API_BASE=
Then run script with 8 GPUs:
#!/bin/bash
set -x
export GPU=$(nvidia-smi --list-gpus | wc -l)
export LMUData=/LMUData
torchrun --nproc-per-node=8 run.py --model Qwen2.5-VL-7B-Instruct \
--data SeePhys\
--api-nproc 32 \
--work-dir /work_dir\
--judge deepseek \
--judge-args '{"valid_type": "LLM"}' \
--reuse
Examples for problem versions within different subjects, knowledge levels and vision information enrichments.
If you find SeePhys useful for your research and applications, please kindly cite using this BibTeX:
@article{xiang2025seephys,
title={SeePhys: Does Seeing Help Thinking?--Benchmarking Vision-Based Physics Reasoning},
author={Kun Xiang*, Heng Li*, Terry Jingchen Zhang*, Yinya Huang*, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang},
journal={arXiv preprint arXiv:2505.19099},
year={2025}
}Here are the key contributors to this project:
Kun Xiang*1, Heng Li*1, Terry Jingchen Zhang*2, Yinya Huang*2, Zirong Liu1, Peixin Qu1, Jixi He1, Jiaqi Chen4, Yu-Jie Yuan3, Jianhua Han3, Hang Xu3, Hanhui Li1, Mrinmaya Sachan2 , Xiaodan Liang1
1 Sun Yat-sen University 2 ETH Zurich 3 Huawei Noah's Ark Lab 4 The University of Hong Kong






