[NeurIPS 2025] SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Official repository for the paper "SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning".

🌟 For more details, please refer to the project page with dataset exploration and leadboard: SeePhys Webpage.

[🌐 Webpage] [📖 Paper] [🤗 Huggingface Dataset] [🏆 Leaderboard][🎉Challenge]

📖 Outlines

💥 News

[2025.09.19] 🚀 SeePhys is accepted by NeurIPS 2025! See you in San Diego :)
[2025.09.16] 🏆 Technical report of the 1st place solution to our ICML 2025 SeePhys challenge is released! Learn more at the Technical Report.
[2025.08.11] 🔥 GPT-5 (high) now achieves a new SOTA with 63.2% accuracy, surpassing Gemini 2.5 Pro by 8.3%!
[2025.08.11] 🔥 We have released the performance of human experts, surpassing Gemini 2.5 Pro by 32.4%!
[2025.07.11] 💥 Skywork-R1V3 Outperforms Qwen2.5-VL-72B with 32.0% on SeePhys! Learn more at the Skywork-R1V3 blog.
[2025.07.11] 🏆 The competition results of 2nd AI for Math Workshop at ICML 2025 have been announced, with the champion achieving the highest accuracy of 60.56%!
[2025.07.07] 🚀 We release the full set with ground truth at [🤗 Huggingface Dataset]!
[2025.05.27] 🔥 The arXiv paper is online!
[2025.05.24] 🚀 We release the test set without ground truth at [🤗 Huggingface Dataset], and the evaluation code!
[2025.05.24] 🔥 We release the evaluation code .using VLMEvalKit
[2025.05.21] 🎉 Our SeePhys is officially open for challenges at the 2nd AI for Math Workshop at ICML 2025!

👀 About SeePhys

SeePhys is a full spectrum multimodal benchmark for evaluating physics reasoning across different knowledge levels.

It comprises 2,000 rigorously validated questions covering a comprehensive range of knoledge levels from middle school to PhD qualifying exam levels. These questions span 7 major fields of both classical and modern physics. To assess the extent to which different models rely on visual information for reasoning, we curate two subsets with different visual information enrichment and additionally compile supplementary copies of 2,000 purely visual instances where all problem statements in texts are presented in picture form. Through meticulous selection of 21 diagram types by domain experts, each problem challenges frontier MLLMs to integrate domain knowledge with visual understanding of physics diagrams (e.g., Feynman diagrams for particle interactions and Circuit diagrams for Electromagnetism).

The figure below showcases examples of Vision-Essential and Vision-Optional samples.

Our experiments reveal that MLLMs encounter significant challenges in solving Vision-Essential problems, whereas for Vision-Optional problems, even when images only provide supplementary information, they can still enhance the model's problem-solving capabilities.

With **SeePhys**, we conduct extensive experiments to evaluate 28 leading LLMs and MLLMs such as o4-mini and Gemini-2.5-Pro. The results reveal that even with extensive chain-of-thought, none of the current models could surpass 55\% accuracy. Our analysis reveals that even non-essential diagrams can enhance physics reasoning performance when presented to MLLMs.

🏆 Leaderboard on SeePhys (2000 samples)

Accuracy scores of LLMs:

#	LLMs	Mid	High	BO	AO	UG	SUG	MA	PhD	Total
1	Human Expert🥇	100.0	94.4	92.3	71.7	92.9	94.7	100.0	83.0	86.5
2	DeepSeek-R1🥈	54.9	46.9	47.7	31.9	49.9	34.2	49.0	41.2	42.2
3	DeepSeek-V3🥉	53.9	42.6	36.4	22.8	45.4	29.7	35.9	37.5	36.0
4	Qwen3-235B-A22B	47.1	33.7	31.8	20.4	41.2	25.1	31.7	30.7	31.1
5	QwQ-32B	47.1	42.2	44.9	15.5	40.0	20.1	32.4	24.0	29.7
6	R1-Distilled-Llama-70B	48.0	41.4	34.6	14.2	31.5	16.0	28.9	25.9	26.9
7	Llama-4-Scout-17B	48.0	36.5	31.8	11.3	28.5	14.2	28.3	26.1	24.8
8	Qwen2.5-72B	41.2	40.2	25.2	8.2	26.8	12.8	18.6	17.8	21.1
9	Gemma3-27B	21.6	36.5	30.8	5.1	23.1	9.1	15.2	11.9	16.9
10	Llama-3.1-8B	26.5	15.7	17.8	3.9	7.6	3.7	10.3	8.4	9.2

Accuracy scores of MLLMs:

#	MLLMs	Mid	High	BO	AO	UG	SUG	MA	PhD	Total
1	Human Expert🥇	100.0	94.4	92.3	71.7	92.9	94.7	100.0	83.0	86.5
2	GPT-5 (high)🥈	75.5	70.7	70.1	55.5	65.9	63.9	60.7	60.4	63.2
2	Gemini-2.5-Pro🥉	69.6	66.7	64.5	46.7	64.2	50.2	53.8	44.2	54.9
3	o4-mini	66.7	61.8	56.1	41.8	53.8	45.7	51.0	53.4	51.9
4	o1	60.8	56.6	50.5	32.5	54.4	40.6	52.4	40.4	45.6
5	Doubao-1.5-pro	70.6	58.2	49.5	29.2	56.6	34.7	40.7	37.5	43.9
6	o3-mini	47.1	46.2	39.3	28.3	47.0	36.1	48.3	42.3	40.3
7	GPT-4.1	51.0	52.6	41.1	17.0	39.7	31.1	42.1	35.6	35.3
8	Claude-3.7-Sonnet	52.9	51.8	43.0	16.7	41.4	26.5	33.8	32.4	34.6
9	Qwen2.5-VL-72B-Inst	61.8	42.2	29.0	10.4	29.9	14.6	18.6	19.4	24.2
10	QVQ-72b-preview	38.2	36.5	30.8	11.3	25.9	14.2	26.2	20.2	22.5
11	GPT-4o	37.3	39.0	34.6	7.5	23.4	15.5	24.1	21.8	21.9
12	Llama-3.2-90B-Vision	21.6	25.7	22.4	3.9	9.3	10.0	12.4	8.9	11.7
13	Qwen2.5-VL-7B-Inst	39.2	25.3	21.5	4.2	8.7	5.9	10.3	7.3	11.6
14	Qwen2.5-VL-3B-Inst	30.4	21.3	13.1	2.9	10.4	7.3	6.2	6.2	9.8
15	Qwen2-VL-7B-Inst	24.5	17.3	14.0	4.4	8.5	4.6	10.3	7.0	9.2
16	LLaVA-NeXT-7B	14.5	12.7	11.2	5.5	13.2	8.2	11.0	9.4	8.7
17	Llama3.2-11B-Vision	23.5	18.5	14.0	4.2	5.4	3.7	4.8	7.5	8.3
18	Phi-4-multimodal	20.6	12.4	12.1	4.4	7.0	5.0	8.3	4.9	7.6
19	InternVL2.5-8B	17.6	12.4	9.3	2.9	5.6	3.2	4.1	5.1	6.2
20	LLaVA-OneVision-7B	20.6	10.8	12.1	2.7	5.4	2.3	6.2	5.4	6.1

💪Contributing to our Leaderboard

Our SeePhys is now open for submissions at the ICML 2025 Challenge on Automated Math Reasoning and Extensions! To evaluate your model, please submit benchmark results to our website following the official guidelines.

We strongly encourage all participants to concurrently submit their technical reports to the ICML 2025 AI for Math Workshop.

Submission Guidelines: https://www.codabench.org/competitions/7925/
Challenge Details: https://sites.google.com/view/ai4mathworkshopicml2025/challenge
Workshop Details: https://sites.google.com/view/ai4mathworkshopicml2025/home

🚀 Evaluation with VLMEvalKit

We have provided the evaluation code using VLMEvalKit for our SeePhys. To ensure fairness in the challenge, we will release the ground-truth to the public on July 18th. Create the environment:

cd seephys-project
pip install -e .

Set your API key:

export OPENAI_API_KEY=
export OPENAI_API_BASE=

Then run script with 8 GPUs:

#!/bin/bash
set -x
export GPU=$(nvidia-smi --list-gpus | wc -l)
export LMUData=/LMUData
torchrun --nproc-per-node=8 run.py --model Qwen2.5-VL-7B-Instruct \
    --data SeePhys\
    --api-nproc 32 \
    --work-dir /work_dir\
    --judge deepseek \
    --judge-args '{"valid_type": "LLM"}'  \
    --reuse

📐 Dataset Examples

Examples for problem versions within different subjects, knowledge levels and vision information enrichments.

📈 Citation

If you find SeePhys useful for your research and applications, please kindly cite using this BibTeX:

@article{xiang2025seephys,
  title={SeePhys: Does Seeing Help Thinking?--Benchmarking Vision-Based Physics Reasoning},
  author={Kun Xiang*, Heng Li*, Terry Jingchen Zhang*, Yinya Huang*, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang},
  journal={arXiv preprint arXiv:2505.19099},
  year={2025}
}

🤝 Contributors

Here are the key contributors to this project:

Kun Xiang*¹, Heng Li*¹, Terry Jingchen Zhang*², Yinya Huang*², Zirong Liu¹, Peixin Qu¹, Jixi He¹, Jiaqi Chen⁴, Yu-Jie Yuan³, Jianhua Han³, Hang Xu³, Hanhui Li¹, Mrinmaya Sachan² , Xiaodan Liang¹

¹ Sun Yat-sen University ² ETH Zurich ³ Huawei Noah's Ark Lab ⁴ The University of Hong Kong

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
requirements		requirements
scripts		scripts
vlmeval		vlmeval
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[NeurIPS 2025] SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

📖 Outlines

💥 News

👀 About SeePhys

🏆 Leaderboard on SeePhys (2000 samples)

💪Contributing to our Leaderboard

🚀 Evaluation with VLMEvalKit

📐 Dataset Examples

📈 Citation

🤝 Contributors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

AI4Phys/SeePhys

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS 2025] SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

📖 Outlines

💥 News

👀 About SeePhys

🏆 Leaderboard on SeePhys (2000 samples)

💪Contributing to our Leaderboard

🚀 Evaluation with VLMEvalKit

📐 Dataset Examples

📈 Citation

🤝 Contributors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages