VGC-Bench

This is the official code for VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon.

This benchmark includes:

multi-agent reinforcement learning (RL) with 4 Policy Space Response Oracle (PSRO) algorithms to fine-tune an agent initialized either randomly or with the output of the BC pipeline
a behavior cloning (BC) pipeline to gather human demonstrations, process them into state-action pairs, and train a model to imitate human play
a basic Large Language Model (LLM) player that any LLM can easily be plugged into
3 heuristic players from poke-env

🛠️ Setup

Prerequisites:

Python (I use v3.13)
NodeJS and npm (whatever pokemon-showdown requires)

Run the following to ensure that pokemon showdown is configured:

git submodule update --init --recursive
cd pokemon-showdown
npm i
node pokemon-showdown start --no-security

Let that run until you see the following text:

RESTORE CHATROOM: lobby
RESTORE CHATROOM: staff
Worker 1 now listening on 0.0.0.0:8000
Test your server at http://localhost:8000

This shows that you can locally host the showdown server.

Install project dependencies by running:

pip install .[dev]

👨‍💻 How to use

NOTE: Unless you're playing your policy on the live Pokémon Showdown servers with play.py, you must locally host your own server by running node pokemon-showdown start <PORT> --no-security from pokemon-showdown/ (done automatically if using bash scripts).

All .py files in vgc_bench/ are scripts and (with the exception of scrape_data.py and visualize.py) have --help text. By contrast, all .py files in vgc_bench/src/ are not scripts, and are not intended to be run standalone.

🏆 Population-based Reinforcement Learning

The training code offers the following PSRO algorithms:

pure self-play
fictitious play
double oracle method
policy exploitation

...as well as some special training options:

initializing the policy with the output of the BC pipeline (requires manually copying the BC policy file into the training run's save folder)
frame stacking with specified number of frames
excluding mirror matches (p1 and p2 using the same team)
starting agent with random teampreview at the beginning of each game

See train.sh for running multiple training runs simultaneously with automatic pokemon-showdown server management.

📚 Behavior Cloning

scrape_logs.py scrapes logs from the Pokémon Showdown replay database, automatically filtering out bad logs and only scraping logs with open team sheets (OTS)
- optional parallelization (strongly recommended)
- if you don't need logs after 01/09/2026, just download our pre-scraped dataset of logs: vgc-battle-logs
logs2trajs.py parses the logs into trajectories composed of state-action transitions
- optional parallelization (strongly recommended)
- --min_rating and --only_winner can be used to filter out low-Elo and losing trajectories respectively
pretrain.py uses the gathered trajectories to train a policy with behavior cloning
- frame stacking with specified number of frames
- configurable fraction of dataset to load into memory at any given time (if not set low enough, program may run out of memory)
- see pretrain.sh for running behavior cloning with automatic pokemon-showdown server management

🤖 LLMs

See llm.py for the provided LLMPlayer wrapper class. We use meta-llama/Meta-Llama-3.1-8B-Instruct, but the user may replace logic in the setup_llm and get_response methods to use a different LLM.

🎲 Heuristics

See poke-env for detailed examples of using the heuristic players. For example:

import asyncio

from poke_env import cross_evaluate
from poke_env.player import MaxBasePowerPlayer, RandomPlayer, SimpleHeuristicsPlayer

random_player = RandomPlayer()
mbp_player = MaxBasePowerPlayer()
sh_player = SimpleHeuristicsPlayer()
results = asyncio.run(cross_evaluate([random_player, mbp_player, sh_player], n_challenges=100))
print(results)

📊 Evaluation

eval.py runs the cross-play evaluation, performance test, generalization test, and ranking algorithm as described in our paper (see above)
- see eval.sh for running multiple evaluations simultaneously with automatic pokemon-showdown server management
play.py loads a saved policy onto the live Pokémon Showdown servers, where the policy can receive challenges from other users or enter the online Elo ladder
visualize.py processes cross-evaluation results into heatmaps and features conversion functions for LaTeX and Markdown formats

Cross-evaluation of all AI agents

For each run, 200 battles were used to compare agents, except for LLM player which was compared with 20 battles. The heatmap below averages the results of 5 independent training runs for each trainable agent, accounting for 1000 total battles in each agent comparison, and 100 battles per comparison for the LLM player.

Legend: R = random player, MBP = max base power player, SH = simple heuristics player, LLM = LLM player, SP = self-play agent, FP = fictitious play agent, DO = double oracle agent, BC = behavior cloning agent, BCSP = self-play agent initialized with behavior cloning, BCFP = fictitious play agent initialized with behavior cloning, BCDO = double oracle agent initialized with behavior cloning

Performance Test

This test compares the performance of the strongest method on average across runs 1-5 of the 1, 4, 16, and 64 team setting with the one team that they all had training exposure to.

# teams	1 (BCSP)	4 (BCSP)	16 (BCDO)	64 (BCSP)
1 (BCSP)	--	0.699	0.74	0.698
4 (BCSP)	0.301	--	0.594	0.672
16 (BCDO)	0.26	0.406	--	0.644
64 (BCSP)	0.302	0.328	0.356	--

Generalization Test

This test compares the performance of the strongest method on average across runs 1-5 of the 1, 4, 16, and 64 team setting with 72 teams that none of them had training exposure to.

# teams	1 (BCSP)	4 (BCSP)	16 (BCDO)	64 (BCSP)
1 (BCSP)	--	0.405	0.375	0.331
4 (BCSP)	0.595	--	0.453	0.422
16 (BCDO)	0.625	0.547	--	0.436
64 (BCSP)	0.669	0.578	0.564	--

See our paper for further results and details.

📜 Cite us

@inproceedings{anglissvgc,
  title={VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pok{\'e}mon},
  author={Angliss, Cameron L and Cui, Jiaxun and Hu, Jiaheng and Rahman, Arrasy and Stone, Peter},
  booktitle={The 25th International Conference on Autonomous Agents and Multi-Agent Systems}
}

Name		Name	Last commit message	Last commit date
Latest commit History 803 Commits
data		data
figures		figures
pokemon-showdown @ 2cf6fd0		pokemon-showdown @ 2cf6fd0
vgc_bench		vgc_bench
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
eval.sh		eval.sh
pretrain.sh		pretrain.sh
pyproject.toml		pyproject.toml
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VGC-Bench

🛠️ Setup

👨‍💻 How to use

🏆 Population-based Reinforcement Learning

📚 Behavior Cloning

🤖 LLMs

🎲 Heuristics

📊 Evaluation

Cross-evaluation of all AI agents

Performance Test

Generalization Test

📜 Cite us

About

Uh oh!

Releases

Languages

License

cameronangliss/vgc-bench

Folders and files

Latest commit

History

Repository files navigation

VGC-Bench

🛠️ Setup

👨‍💻 How to use

🏆 Population-based Reinforcement Learning

📚 Behavior Cloning

🤖 LLMs

🎲 Heuristics

📊 Evaluation

Cross-evaluation of all AI agents

Performance Test

Generalization Test

📜 Cite us

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages