This is the official code for VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon.
This benchmark includes:
- multi-agent reinforcement learning (RL) with 4 Policy Space Response Oracle (PSRO) algorithms to fine-tune an agent initialized either randomly or with the output of the BC pipeline
- a behavior cloning (BC) pipeline to gather human demonstrations, process them into state-action pairs, and train a model to imitate human play
- a basic Large Language Model (LLM) player that any LLM can easily be plugged into
- 3 heuristic players from poke-env
Prerequisites:
- Python (I use v3.13)
- NodeJS and npm (whatever pokemon-showdown requires)
Run the following to ensure that pokemon showdown is configured:
git submodule update --init --recursive
cd pokemon-showdown
npm i
node pokemon-showdown start --no-security
Let that run until you see the following text:
RESTORE CHATROOM: lobby
RESTORE CHATROOM: staff
Worker 1 now listening on 0.0.0.0:8000
Test your server at http://localhost:8000
This shows that you can locally host the showdown server.
Install project dependencies by running:
pip install .[dev]
NOTE: Unless you're playing your policy on the live Pokémon Showdown servers with play.py, you must locally host your own server by running node pokemon-showdown start <PORT> --no-security from pokemon-showdown/ (done automatically if using bash scripts).
All .py files in vgc_bench/ are scripts and (with the exception of scrape_data.py and visualize.py) have --help text. By contrast, all .py files in vgc_bench/src/ are not scripts, and are not intended to be run standalone.
The training code offers the following PSRO algorithms:
- pure self-play
- fictitious play
- double oracle method
- policy exploitation
...as well as some special training options:
- initializing the policy with the output of the BC pipeline (requires manually copying the BC policy file into the training run's save folder)
- frame stacking with specified number of frames
- excluding mirror matches (p1 and p2 using the same team)
- starting agent with random teampreview at the beginning of each game
See train.sh for running multiple training runs simultaneously with automatic pokemon-showdown server management.
- scrape_logs.py scrapes logs from the Pokémon Showdown replay database, automatically filtering out bad logs and only scraping logs with open team sheets (OTS)
- optional parallelization (strongly recommended)
- if you don't need logs after 01/09/2026, just download our pre-scraped dataset of logs: vgc-battle-logs
- logs2trajs.py parses the logs into trajectories composed of state-action transitions
- optional parallelization (strongly recommended)
--min_ratingand--only_winnercan be used to filter out low-Elo and losing trajectories respectively
- pretrain.py uses the gathered trajectories to train a policy with behavior cloning
- frame stacking with specified number of frames
- configurable fraction of dataset to load into memory at any given time (if not set low enough, program may run out of memory)
- see pretrain.sh for running behavior cloning with automatic pokemon-showdown server management
See llm.py for the provided LLMPlayer wrapper class. We use meta-llama/Meta-Llama-3.1-8B-Instruct, but the user may replace logic in the setup_llm and get_response methods to use a different LLM.
See poke-env for detailed examples of using the heuristic players. For example:
import asyncio
from poke_env import cross_evaluate
from poke_env.player import MaxBasePowerPlayer, RandomPlayer, SimpleHeuristicsPlayer
random_player = RandomPlayer()
mbp_player = MaxBasePowerPlayer()
sh_player = SimpleHeuristicsPlayer()
results = asyncio.run(cross_evaluate([random_player, mbp_player, sh_player], n_challenges=100))
print(results)- eval.py runs the cross-play evaluation, performance test, generalization test, and ranking algorithm as described in our paper (see above)
- see eval.sh for running multiple evaluations simultaneously with automatic pokemon-showdown server management
- play.py loads a saved policy onto the live Pokémon Showdown servers, where the policy can receive challenges from other users or enter the online Elo ladder
- visualize.py processes cross-evaluation results into heatmaps and features conversion functions for LaTeX and Markdown formats
For each run, 200 battles were used to compare agents, except for LLM player which was compared with 20 battles. The heatmap below averages the results of 5 independent training runs for each trainable agent, accounting for 1000 total battles in each agent comparison, and 100 battles per comparison for the LLM player.
Legend: R = random player, MBP = max base power player, SH = simple heuristics player, LLM = LLM player, SP = self-play agent, FP = fictitious play agent, DO = double oracle agent, BC = behavior cloning agent, BCSP = self-play agent initialized with behavior cloning, BCFP = fictitious play agent initialized with behavior cloning, BCDO = double oracle agent initialized with behavior cloning
This test compares the performance of the strongest method on average across runs 1-5 of the 1, 4, 16, and 64 team setting with the one team that they all had training exposure to.
| # teams | 1 (BCSP) | 4 (BCSP) | 16 (BCDO) | 64 (BCSP) |
|---|---|---|---|---|
| 1 (BCSP) | -- | 0.699 | 0.74 | 0.698 |
| 4 (BCSP) | 0.301 | -- | 0.594 | 0.672 |
| 16 (BCDO) | 0.26 | 0.406 | -- | 0.644 |
| 64 (BCSP) | 0.302 | 0.328 | 0.356 | -- |
This test compares the performance of the strongest method on average across runs 1-5 of the 1, 4, 16, and 64 team setting with 72 teams that none of them had training exposure to.
| # teams | 1 (BCSP) | 4 (BCSP) | 16 (BCDO) | 64 (BCSP) |
|---|---|---|---|---|
| 1 (BCSP) | -- | 0.405 | 0.375 | 0.331 |
| 4 (BCSP) | 0.595 | -- | 0.453 | 0.422 |
| 16 (BCDO) | 0.625 | 0.547 | -- | 0.436 |
| 64 (BCSP) | 0.669 | 0.578 | 0.564 | -- |
See our paper for further results and details.
@inproceedings{anglissvgc,
title={VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pok{\'e}mon},
author={Angliss, Cameron L and Cui, Jiaxun and Hu, Jiaheng and Rahman, Arrasy and Stone, Peter},
booktitle={The 25th International Conference on Autonomous Agents and Multi-Agent Systems}
}