autoresearch-qwen

Autonomous research loop for improving Qwen/Qwen3-VL-4B-Instruct on the official HuggingFaceM4/DocumentVQA benchmark.

The repo is designed for agentic training research: the benchmark and evaluator stay fixed, while an agent iterates on train.py, runs training, measures the result on the full validation split, and keeps only real gains. The project is inspired by karpathy/autoresearch, but scoped to a concrete public VLM benchmark with a reproducible contract.

If this project is useful for your research, evals, or agent workflows, please star the repo.

Branches

This repository now contains the two previously separate codebases as different branches:

Branch	Target hardware	Status
`main`	NVIDIA / CUDA multi-GPU	Primary branch. Uses `torchrun`, supports DeepSpeed configs, and is the recommended branch for fast experiment cycles.
`mlx`	Apple Silicon / MPS	Historical branch imported from the former `autoresearch-qwen-mlx` repository and preserved here as the Mac-focused variant.

Use the README on each branch for branch-specific commands. On main the entrypoint is ./run_experiment.sh; on mlx it is uv run python run_experiment.py.

Why This Repo Exists

Fixed benchmark: full official DocVQA train, validation, and test splits
Fixed evaluator: validation score is always computed by the repository evaluator
One mutable surface: agents are expected to edit train.py
Reproducible loop: prepare, train, evaluate, keep or discard, repeat
Public benchmark mindset: improvements should come from better training decisions, not from moving the goalposts

Benchmark Contract

Component	Contract
Base model	`Qwen/Qwen3-VL-4B-Instruct`
Dataset	`HuggingFaceM4/DocumentVQA` official splits
Training split	Full `train` split
Validation split	Full `validation` split
Test split	Full blind `test` split
Metric	Mean ANLS on the full validation split
Mutable file	`train.py`
Fixed files	`evaluate.py`, `src/`, benchmark contract, submission tooling

More benchmark details are documented in benchmarks/README.md.

How The Loop Works

prepare.py          Download dataset + model snapshot
      |
train.py            Mutable training code (the agent edits this)
      |
evaluate.py         Fixed validation evaluator / blind test exporter
      |
run_experiment.sh   One full train -> eval iteration on main

The only objective is to maximize val_score, defined as mean ANLS on the full official validation split.

Quick Start (`main`)

curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv run python prepare.py
uv run autoresearch-qwen doctor
uv run python evaluate.py --base-only --split validation
./run_experiment.sh | tee run.log

Useful follow-up commands:

uv run python analysis.py
uv run python submit_test.py

Repository Layout

train.py                            Mutable training code
evaluate.py                         Fixed evaluator
run_experiment.sh                   One-command train -> eval pipeline on main
analysis.py                         Result visualization
prepare.py                          Dataset + model downloader
submit_test.py                      Blind test export + submission packaging
check_submission.py                 Submission validator
program.md                          Full agent protocol
benchmarks/README.md                Benchmark definition
configs/                            DeepSpeed configs for multi-GPU runs
src/autoresearch_qwen/              Fixed library code

Running An Agent

The full experiment protocol lives in program.md. A practical starting prompt is:

Read the entire repository, especially README.md and program.md. You may read all files for context, but only edit train.py. Run `uv run autoresearch-qwen doctor --json`, record a `--base-only` validation baseline, then start the autoresearch loop. Parse `artifacts/last_result.json` after each run and keep only changes that improve val_score.

Results, Analysis, and Submission

artifacts/last_result.json stores the latest train/eval result payload
analysis.py plots experiment progress from accumulated results
submit_test.py exports predictions for the blind DocVQA test split
check_submission.py validates a submission bundle locally before upload

Contributing

Issues and pull requests are welcome, especially for:

stronger training recipes that respect the benchmark contract
better experiment tooling and reproducibility
clearer docs and onboarding
hardware-specific improvements that belong on a dedicated branch

If you want to change the benchmark contract itself, open an issue first so the rationale is explicit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch-qwen

Branches

Why This Repo Exists

Benchmark Contract

How The Loop Works

Quick Start (`main`)

Repository Layout

Running An Agent

Results, Analysis, and Submission

Contributing

Acknowledgements

Star Trend

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
.vscode		.vscode
benchmarks		benchmarks
configs		configs
src/autoresearch_qwen		src/autoresearch_qwen
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
analysis.py		analysis.py
check_submission.py		check_submission.py
evaluate.py		evaluate.py
prepare.py		prepare.py
program.md		program.md
pyproject.toml		pyproject.toml
run_experiment.sh		run_experiment.sh
submit_test.py		submit_test.py
train.py		train.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

autoresearch-qwen

Branches

Why This Repo Exists

Benchmark Contract

How The Loop Works

Quick Start (main)

Repository Layout

Running An Agent

Results, Analysis, and Submission

Contributing

Acknowledgements

Star Trend

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Quick Start (`main`)

Packages