AlphaZero-style self-play training pipeline for chess, built with PyTorch and optimized for Apple Silicon (MPS).
- Python 3.12+
- uv package manager
- Stockfish (for ELO evaluation only)
- macOS with Apple Silicon recommended (MPS acceleration)
cd training
uv syncThe neural network is a ResNet with dual heads (policy + value), trained via self-play:
| Component | Details |
|---|---|
| Input | 18 channels x 8x8 (pieces, castling, en passant, side to move) |
| Residual blocks | 6 blocks, 128 filters each |
| Policy head | 4096 outputs (64 from-squares x 64 to-squares) |
| Value head | 1 output, tanh activation [-1, 1] |
| Parameters | ~2M |
uv run python -u train.py --verbose-self-playUse -u for unbuffered output so logs appear in real time. Use --verbose-self-play to see per-game progress during self-play.
With --verbose-self-play, you'll see per-game results as they complete:
Game 1/20: 1-0 in 87 moves (CHECKMATE) - 42.3s
Game 2/20: 1/2-1/2 in 124 moves (STALEMATE) - 58.1s
...
--- Self-Play Summary ---
Games played: 20
Total positions: 1965
Average game length: 98.2 moves
Results: White +8, Black +7, Draws =5
Total time: 1153.1s (57.7s per game)
Without it, you only see a summary line after each full iteration completes.
Use --quiet / -q to suppress all console output.
uv run python -u train.py \
--verbose-self-play \
--iterations 80000 \
--games-per-iter 100 \
--batch-size 256 \
--mcts-sims 400 \
--num-blocks 6 \
--num-filters 128Training all 80K iterations in one run takes a very long time. A practical approach is to train in stages, resuming from each checkpoint:
Stage 1 — Beginner (target ~1000 ELO)
uv run python -u train.py \
--verbose-self-play \
--iterations 500 \
--games-per-iter 20 \
--mcts-sims 100 \
--save-every 50At ~1000 ELO the model should avoid blundering pieces, make basic captures, and play legal-looking chess. Evaluate early checkpoints against Stockfish depth 1.
Stage 2 — Intermediate (target ~1200 ELO)
uv run python -u train.py \
--verbose-self-play \
--resume checkpoints/checkpoint_500.pt \
--iterations 2500 \
--games-per-iter 20 \
--mcts-sims 100 \
--save-every 250At ~1200 ELO the model should have basic tactical awareness (pins, forks), develop pieces, and avoid trivial draws. Evaluate against Stockfish depth 1-2.
Stage 3 — Club Player (target ~1500 ELO)
uv run python -u train.py \
--verbose-self-play \
--resume checkpoints/checkpoint_2500.pt \
--iterations 5000 \
--games-per-iter 30 \
--mcts-sims 150 \
--save-every 500Stage 4 — Advanced (target ~2000 ELO)
uv run python -u train.py \
--verbose-self-play \
--resume checkpoints/checkpoint_5000.pt \
--iterations 30000 \
--games-per-iter 50 \
--mcts-sims 200 \
--save-every 1000Stage 5 — Master (target ~2200 ELO)
uv run python -u train.py \
--verbose-self-play \
--resume checkpoints/checkpoint_30000.pt \
--iterations 80000 \
--games-per-iter 100 \
--mcts-sims 400You can increase --games-per-iter and --mcts-sims at later stages since the model benefits more from stronger self-play as it improves.
The training loop writes a CSV log to checkpoints/training_log.csv with per-iteration metrics. Use this to diagnose training health:
| Metric | Healthy Sign | Problem Sign |
|---|---|---|
checkmates |
Increasing over time | Stuck at 0 after many iterations |
repetition_draws |
Decreasing or low fraction | 100% of games end in repetition |
avg_game_length |
40-150 moves, not monotonically decreasing | Collapsing to <40 moves (repetition collapse) |
value_loss |
>0.01, meaningfully contributing to total loss | Near 0 (all games are draws, value head starved) |
white_wins/black_wins |
Both >0, roughly balanced | Both stuck at 0 (no decisive games) |
If you see repetition collapse (all games ending in FIVEFOLD_REPETITION with short game lengths), the Dirichlet noise and repetition penalty should help. If it persists, try increasing --c-puct (e.g., 2.0-3.0) to encourage more exploration.
Use this as a reference when monitoring checkpoints/training_log.csv. Numbers are approximate — your run may differ, but the trends should match. Iteration ranges below align with the staged training plan above.
Expected: policy_loss ~7-8, value_loss ~0.01-0.05, avg_game_length 100-256
checkmates: 0-1, repetition_draws: 0-5, max_moves_draws: 5-15
white_wins: 0-1, black_wins: 0-1, draws: 18-20
The model plays essentially random legal moves. Games are long and almost all end in draws (max moves, insufficient material, or stalemate). This is normal — the model has no chess knowledge yet.
What's OK: All draws, no checkmates, high policy loss. Red flag: If games are already very short (<50 moves) with all repetition draws, Dirichlet noise may not be working.
Expected: policy_loss ~5-6 (dropping steadily), value_loss ~0.01-0.1
avg_game_length: 80-200, gradually decreasing
checkmates: 0-2 per iteration, draws: 15-20
repetition_draws: should be <50% of games
The model starts learning which pieces are valuable and basic captures. Games get shorter as the model learns to take hanging pieces. Policy loss should drop noticeably.
What's OK: Mostly draws still, but some decisive games appearing. Game lengths decreasing. Red flag: Policy loss not decreasing, or avg_game_length dropping below 40 with all repetition draws.
Expected: policy_loss ~3.5-5, value_loss ~0.05-0.2
avg_game_length: 60-120
checkmates: 1-5 per iteration (increasing trend)
white_wins + black_wins: 2-8 per iteration
repetition_draws: <30% of games
The model develops basic tactical awareness — it can capture pieces intentionally and starts mating in simple endgames. Value loss should be rising as the model sees more decisive games and the value head gets meaningful training signal.
Evaluate at iteration 500: Run against Stockfish depth 1. Target: win rate >30% → ~1000 ELO.
What's OK: Mix of decisive games and draws. Checkmates appearing semi-regularly.
Red flag: Value loss still near 0, zero checkmates after 250 iterations. Try increasing --c-puct to 2.0-3.0.
Expected: policy_loss ~2.5-3.5, value_loss ~0.1-0.3
avg_game_length: 50-100
checkmates: 3-8 per iteration
white_wins + black_wins: 5-12 per iteration
repetition_draws: <20% of games
The model learns basic opening principles (develop pieces, control center) and can execute simple tactical combinations. Games should have a healthy mix of decisive outcomes and draws.
Evaluate at iteration 2500: Run against Stockfish depth 1-2. Target: ~50% score vs depth 1 → ~1200 ELO.
What's OK: Roughly balanced white/black wins, checkmates in most iterations. Red flag: One side winning much more than the other (>80% of decisive games).
Expected: policy_loss ~2.0-3.0, value_loss ~0.15-0.4
avg_game_length: 50-90
checkmates: 5-10 per iteration
white_wins + black_wins: 8-15 per iteration
The model develops positional understanding — pawn structure, piece coordination, king safety. Games should be shorter and more decisive.
Evaluate at iteration 5000: Run against Stockfish depth 2-3. Target: ~50% score vs depth 2 → ~1500 ELO.
Expected: policy_loss ~1.5-2.5, value_loss ~0.2-0.5
avg_game_length: 40-80
checkmates: consistent 5+ per iteration
Loss will plateau — the model is refining rather than learning new concepts. Increase --mcts-sims to 200 and --games-per-iter to 50 for higher quality self-play.
Evaluate at iteration 30000: Run against Stockfish depth 5. Target: ~50% score → ~2000 ELO.
Expected: policy_loss ~1.0-2.0, value_loss ~0.3-0.5
avg_game_length: 40-70
Increase --mcts-sims to 400 and --games-per-iter to 100. Improvements will be incremental.
Evaluate at iteration 80000: Run against Stockfish depth 8. Target: ~50% score → ~2200 ELO.
| Symptom | Likely Cause | Fix |
|---|---|---|
| All games end in FIVEFOLD_REPETITION | Dirichlet noise too low, or model collapsed to repetitive policy | Increase --c-puct to 2.0-3.0, or restart from earlier checkpoint |
| Policy loss stuck / not decreasing | Learning rate too low, or buffer too large (stale data dilutes signal) | Decrease --buffer-size to 100K, or increase --lr |
| Value loss stays near 0 | All games are draws, value head is starved | Check that Dirichlet noise is working (games should be diverse). Increase repetition penalty |
| Policy loss oscillating wildly | Gradient explosions | Decrease --lr, gradient clipping should help (already enabled, max_norm=1.0) |
| One side always wins | Asymmetric training signal | Likely fine — it can self-correct as both sides improve. Monitor over 10+ iterations |
| Training very slow per iteration | MCTS simulations are bottleneck | Reduce --mcts-sims for early stages (100 is fine for <1500 ELO) |
| Loss suddenly jumps up after resume | Learning rate schedule mismatch or buffer was emptied | Normal — new self-play data from improved model takes time to stabilize |
| avg_game_length increasing after initial decrease | Model entering positional play phase (longer games = more strategic) | Good sign — this is healthy if checkmates are still occurring |
Resume training from any saved checkpoint:
uv run python -u train.py --verbose-self-play --resume checkpoints/checkpoint_5000.ptThe checkpoint restores model weights, optimizer state, learning rate schedule, and iteration count. Training continues from where it left off.
A checkpoint_latest.pt and buffer_latest.npz are saved every iteration, overwriting the previous version. If training crashes or is interrupted:
# Resume from the last completed iteration (no lost progress)
uv run python -u train.py --verbose-self-play --resume checkpoints/checkpoint_latest.ptThis restores:
- Model weights, optimizer state, and LR schedule from
checkpoint_latest.pt - Replay buffer contents from
buffer_latest.npz(so you don't regenerate all self-play data) - Iteration count (continues from exactly where it stopped)
- The CSV training log is append-only, so all previous metrics are preserved
The numbered checkpoints (e.g., checkpoint_500.pt) are still saved at milestone intervals for ELO evaluation and export. The checkpoint_latest.pt is only for crash recovery.
| Parameter | Flag | Default |
|---|---|---|
| Iterations | --iterations |
80,000 |
| Games per iteration | --games-per-iter |
100 |
| Batch size | --batch-size |
256 |
| Batches per iteration | --batches-per-iter |
10 |
| Replay buffer size | --buffer-size |
500K |
| MCTS simulations/move | --mcts-sims |
400 |
| Learning rate | --lr / --lr-final |
0.001 -> 0.0001 (decay) |
| Optimizer | Adam (weight decay 1e-4) | |
| Verbose self-play | --verbose-self-play |
off |
| Quiet mode | --quiet / -q |
off |
| Save interval | --save-every |
0 (disabled) |
| Resume | --resume |
none |
Checkpoints are saved at iterations: 10, 25, 50, 100, 250, 500, 1K, 2.5K, 5K, 10K, 30K, 80K (plus any --save-every interval). A CSV training log is also written to checkpoints/training_log.csv with per-iteration metrics including game outcomes, termination types, and average game length.
After training, evaluate checkpoints against Stockfish to estimate their rating.
uv run python evaluate.py checkpoints/checkpoint_5000.pt \
--stockfish-path /path/to/stockfishuv run python evaluate.py checkpoints/checkpoint_*.pt \
--stockfish-path /path/to/stockfish \
--num-games 20 \
--stockfish-depth 5| Flag | Default | Description |
|---|---|---|
--stockfish-path |
stockfish |
Path to Stockfish binary |
--num-games |
20 | Games per checkpoint |
--stockfish-depth |
5 | Stockfish search depth |
--stockfish-time-limit |
1.0 | Stockfish seconds per move |
--mcts-simulations |
200 | MCTS simulations for model moves |
| Depth | ~ELO |
|---|---|
| 1 | 1300 |
| 2 | 1500 |
| 3 | 1700 |
| 5 | 2000 |
| 8 | 2300 |
| 10 | 2500 |
| 15 | 2800 |
| 20 | 3000 |
The script uses the inverse ELO formula:
score = (wins + 0.5 * draws) / total_games
elo_diff = -400 * log10(1/score - 1)
estimated_elo = stockfish_elo + elo_diff
A 95% confidence interval is computed using the Wilson score interval.
The goal is to identify checkpoints that correspond to the target difficulty tiers:
| Target | ELO | Suggested approach |
|---|---|---|
| Beginner | 1000 | Evaluate very early checkpoints (50-500) vs depth 1 |
| Intermediate | 1200 | Evaluate early checkpoints (500-2500) vs depth 1-2 |
| Club Player | 1500 | Evaluate mid checkpoints (2500-5K) vs depth 2-3 |
| Advanced | 2000 | Evaluate later checkpoints (5K-30K) vs depth 5 |
| Master | 2200 | Evaluate late checkpoints (30K-80K) vs depth 8 |
Example output when evaluating multiple checkpoints:
====================================================
Checkpoint -> ELO Mapping
====================================================
checkpoint_5000.pt | W:8 D:4 L:8 | Score: 50.0% | ELO: ~2000 +/- 150
checkpoint_10000.pt | W:10 D:5 L:5 | Score: 62.5% | ELO: ~2088 +/- 130
checkpoint_30000.pt | W:14 D:3 L:3 | Score: 77.5% | ELO: ~2210 +/- 110
====================================================
Once you've identified the right checkpoints via ELO evaluation, export them to ONNX for the Go runtime:
# Replace checkpoint filenames with the ones that matched each ELO target
uv run python export_onnx.py checkpoints/<best_1000_elo>.pt models/rl_1000.onnx # Stage 1: iter ~100-500
uv run python export_onnx.py checkpoints/<best_1200_elo>.pt models/rl_1200.onnx # Stage 2: iter ~500-2500
uv run python export_onnx.py checkpoints/<best_1500_elo>.pt models/rl_1500.onnx # Stage 3: iter ~2500-5000
uv run python export_onnx.py checkpoints/<best_2000_elo>.pt models/rl_2000.onnx # Stage 4: iter ~5000-30000
uv run python export_onnx.py checkpoints/<best_2200_elo>.pt models/rl_2200.onnx # Stage 5: iter ~30000-80000The exported .onnx files are then embedded into the Go binary via go:embed.
uv run pytest| File | Purpose |
|---|---|
train.py |
Main training loop |
model.py |
ChessNet neural network architecture |
mcts.py |
Monte Carlo Tree Search with UCB selection |
board_encoder.py |
Convert board state to 18-channel tensor |
self_play.py |
Generate self-play games |
replay_buffer.py |
Store and sample training examples |
evaluate.py |
ELO estimation against Stockfish |
export_onnx.py |
Export PyTorch checkpoint to ONNX |
1. Train uv run python -u train.py --verbose-self-play --save-every 500
2. Evaluate uv run python evaluate.py checkpoints/checkpoint_*.pt --stockfish-path stockfish
3. Export uv run python export_onnx.py <best_checkpoint>.pt model.onnx
4. Integrate Copy .onnx files to internal/bot/models/ in the Go project
To train in stages with resume:
1a. Train to 500 uv run python -u train.py --verbose-self-play --iterations 500 --mcts-sims 100 --save-every 50
1b. Evaluate uv run python evaluate.py checkpoints/checkpoint_500.pt --stockfish-path stockfish --stockfish-depth 1
1c. Train to 2500 uv run python -u train.py --verbose-self-play --resume checkpoints/checkpoint_500.pt --iterations 2500 --mcts-sims 100 --save-every 250
1d. Evaluate uv run python evaluate.py checkpoints/checkpoint_2500.pt --stockfish-path stockfish --stockfish-depth 2
1e. Train to 5K uv run python -u train.py --verbose-self-play --resume checkpoints/checkpoint_2500.pt --iterations 5000 --mcts-sims 150 --save-every 500
1f. Evaluate uv run python evaluate.py checkpoints/checkpoint_5000.pt --stockfish-path stockfish --stockfish-depth 3
1g. Train to 30K uv run python -u train.py --verbose-self-play --resume checkpoints/checkpoint_5000.pt --iterations 30000 --mcts-sims 200
1h. Train to 80K uv run python -u train.py --verbose-self-play --resume checkpoints/checkpoint_30000.pt --iterations 80000 --mcts-sims 400
2. Export uv run python export_onnx.py <best_1000_elo>.pt models/rl_1000.onnx
uv run python export_onnx.py <best_1200_elo>.pt models/rl_1200.onnx
uv run python export_onnx.py <best_1500_elo>.pt models/rl_1500.onnx
uv run python export_onnx.py <best_2000_elo>.pt models/rl_2000.onnx
uv run python export_onnx.py <best_2200_elo>.pt models/rl_2200.onnx
3. Integrate Copy .onnx files to internal/bot/models/ in the Go project
4. Monitor Review checkpoints/training_log.csv for training health