ChessBench

ChessBench

Benchmarking LLMs' ability to generate strong chess engines in a standardized harness.

Project Structure

Harness & Generation (adapters/, tools/, harness/): The core structure to elicit engine generations from a wide variety of supported LLM providers. It places models in a standardized harness equipped with confined file tools and a compiler tool to yield valid C++ engines.
Competition Runner (competition/): The logic to run an ongoing round-robin tournament. Compiled engines are practically ELO-tested against each other to evaluate their playing strength.
Local GUI (web-chess-gui/): A small, local web application for quickly viewing games, stats, and results from the internal SQLite competition database.
Benchmark Website (benchmark/): Contains the full source code for the public benchmark website, chessbench.live.

Generate an engine

uv run llm-chess generate --provider openai --model gpt-5.5 --run-id smoke-001

Runs are isolated under generations/[provider-model]/[run-id]. The generated engine must be C++ with a root Makefile; the harness exposes only confined file tools and a compile_engine tool that runs make in the run directory.

Adapters stream by default, which keeps high token-cap runs compatible with providers that require streaming for long outputs. Use --no-stream only when debugging a non-streaming provider path.

Large source files should be written in chunks with write_file followed by append_file. This avoids provider output limits truncating a tool call midway through a large file body. Use --max-output-tokens to raise the per-response output budget when the provider/model supports it.

Provider defaults:

openai: OPENAI_API_KEY
anthropic: ANTHROPIC_API_KEY
gemini: GEMINI_API_KEY
deepseek: DEEPSEEK_API_KEY, https://api.deepseek.com
kimi / moonshot: MOONSHOT_API_KEY, https://api.moonshot.ai/v1

Run the round robin

uv run llm-chess compete --forever --movetime-ms 100

The competition runner discovers compiled generated engines from generations/, schedules the least-played ordered pairing next, validates moves with python-chess, and persists all engine metadata, games, moves, PGN, raw UCI output, time-control config, and aggregate scores to results/competition.sqlite3.

Useful smoke run:

uv run llm-chess compete --max-games 2 --movetime-ms 50

Clock mode is also supported:

uv run llm-chess compete --forever --clock-ms 60000 --increment-ms 500 --move-overhead-ms 20

For overnight runs, pass multiple repeatable --time-control values. They cycle by persisted game count, so restarting the runner continues the rotation:

uv run llm-chess compete --forever \
  --time-control movetime:50 \
  --time-control movetime:200 \
  --time-control clock:60000+500

Openings are randomized from competition/openings.txt by default. Each non-empty line is a space-separated UCI move sequence, and the selected opening line, source, and resulting FEN are persisted with the game. Use --openings-file path/to/book.txt to swap books or --no-openings to disable this.

Before an engine receives opening positions, the runner probes whether it can return a legal move from a non-start FEN. Engines that fail still play in the round robin, but from startpos only; the skip reason is stored in SQLite and PGN as OpeningSkipped.

Use --max-plies, --handshake-timeout-seconds, and --move-timeout-seconds to keep malformed generated engines from blocking the loop.

Build the ELO leaderboard

uv run llm-chess leaderboard

The leaderboard TUI reads results/competition.sqlite3, lets you set known anchor ratings for engines such as Stockfish, and saves results/elo_anchors.json, results/elo_leaderboard.json, and results/elo_leaderboard.md. You can also run it non-interactively:

uv run llm-chess leaderboard --anchor stockfish=3200 --no-tui

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
adapters		adapters
benchmark		benchmark
competition		competition
harness		harness
tests		tests
tools		tools
web-chess-gui		web-chess-gui
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
leaderboard_tui.py		leaderboard_tui.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChessBench

Project Structure

Generate an engine

Run the round robin

Build the ELO leaderboard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ChessBench

Project Structure

Generate an engine

Run the round robin

Build the ELO leaderboard

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages