Skip to content

cameronangliss/vgc-bench

Repository files navigation

VGC-Bench

CI Python 3.10‒3.13 License: MIT arXiv

This is the official code for VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon.

This benchmark includes:

  • multi-agent reinforcement learning (RL) with 4 Policy Space Response Oracle (PSRO) algorithms to fine-tune an agent initialized either randomly or with the output of the BC pipeline
  • a behavior cloning (BC) pipeline to gather human demonstrations, process them into state-action pairs, and train a model to imitate human play
  • a basic Large Language Model (LLM) player that any LLM can easily be plugged into
  • 3 heuristic players from poke-env

🛠️ Setup

Prerequisites:

  1. Python (I use v3.13)
  2. NodeJS and npm (whatever pokemon-showdown requires)

Run the following to ensure that pokemon showdown is configured:

git submodule update --init --recursive
cd pokemon-showdown
npm i
node pokemon-showdown start --no-security

Let that run until you see the following text:

RESTORE CHATROOM: lobby
RESTORE CHATROOM: staff
Worker 1 now listening on 0.0.0.0:8000
Test your server at http://localhost:8000

This shows that you can locally host the showdown server.

Install project dependencies by running:

pip install .[dev]

NOTE: if this doesn't work due to the open-spiel dependency, feel free to remove it in pyproject.toml. It is only necessary for the vgc_bench/eval module.

👨‍💻 How to use

NOTE: Unless you're playing your policy on the live Pokémon Showdown servers with play.py, you must locally host your own server by running node pokemon-showdown start <PORT> --no-security from pokemon-showdown/ (done automatically if using bash scripts).

All .py files in vgc_bench/ are runnable modules and (with the exception of scrape_data.py and visualize.py) have --help text. Run them from the repo root, e.g. python -m vgc_bench.train. By contrast, all .py files in vgc_bench/src/ are not modules, and are not intended to be run standalone.

🏆 Population-based Reinforcement Learning

The training code offers the following PSRO algorithms:

  • pure self-play
  • fictitious play
  • double oracle method
  • policy exploitation

...as well as some special training options:

  • initializing the policy with the output of the BC pipeline; if --behavior_clone is enabled and no local BC checkpoint is present, vgc_bench.train automatically downloads results/saves-bc/seed1/100.zip from the vgc-bench-models model repo
  • frame stacking with specified number of frames
  • excluding mirror matches (p1 and p2 using the same team)
  • starting agent with random teampreview at the beginning of each game
  • matchup solving with specific team strings (pass both --team1 and --team2 to train on a single matchup)

See train.sh for running multiple training runs simultaneously with automatic pokemon-showdown server management, or train_matchup.sh for an example of training on a specific team matchup. If you don't want to run train.py yourself, pre-trained models are available in vgc-bench-models.

📚 Behavior Cloning

  1. scrape_logs.py scrapes logs from the Pokémon Showdown replay database, automatically filtering out bad logs and only scraping logs with open team sheets (OTS)
    • optional parallelization (strongly recommended)
    • if you don't need logs after 03/10/2026, just download our pre-scraped dataset of logs from vgc-battle-logs and place the files in battle_logs/
  2. logs2trajs.py parses the logs into trajectories composed of state-action transitions
    • optional parallelization (strongly recommended)
    • --min_rating and --only_winner can be used to filter out low-Elo and losing trajectories respectively
  3. pretrain.py uses the gathered trajectories to train a policy with behavior cloning
    • frame stacking with specified number of frames
    • configurable fraction of dataset to load into memory at any given time (if not set low enough, program may run out of memory)
    • see pretrain.sh for running behavior cloning with automatic pokemon-showdown server management
    • if you don't want to run pretrain.py yourself, use the pre-trained BC checkpoint in vgc-bench-models

🤖 LLMs

See llm.py for the provided LLMPlayer wrapper class. We use meta-llama/Meta-Llama-3.1-8B-Instruct, but the user may replace logic in the setup_llm and get_response methods to use a different LLM.

🎲 Heuristics

See poke-env for detailed examples of using the heuristic players. For example:

import asyncio

from poke_env import cross_evaluate
from poke_env.player import MaxBasePowerPlayer, RandomPlayer, SimpleHeuristicsPlayer

random_player = RandomPlayer()
mbp_player = MaxBasePowerPlayer()
sh_player = SimpleHeuristicsPlayer()
results = asyncio.run(cross_evaluate([random_player, mbp_player, sh_player], n_challenges=100))
print(results)

📊 Evaluation

  • eval.py runs the cross-play evaluation, performance test, generalization test, and ranking algorithm as described in our paper (see above)
    • see eval.sh for running multiple evaluations simultaneously with automatic pokemon-showdown server management
  • play.py loads a saved policy onto the live Pokémon Showdown servers, where the policy can receive challenges from other users or enter the online Elo ladder
  • visualize.py processes cross-evaluation results into heatmaps and features conversion functions for LaTeX and Markdown formats

Cross-evaluation of all AI agents

For each run, 200 battles were used to compare agents, except for LLM player which was compared with 20 battles. The heatmap below averages the results of 5 independent training runs for each trainable agent, accounting for 1000 total battles in each agent comparison, and 100 battles per comparison for the LLM player.

figures/heatmaps_avg.png

Legend: R = random player, MBP = max base power player, SH = simple heuristics player, LLM = LLM player, SP = self-play agent, FP = fictitious play agent, DO = double oracle agent, BC = behavior cloning agent, BCSP = self-play agent initialized with behavior cloning, BCFP = fictitious play agent initialized with behavior cloning, BCDO = double oracle agent initialized with behavior cloning

Performance Test

This test compares the performance of the strongest method on average across runs 1-5 of the 1, 4, 16, and 64 team setting with the one team that they all had training exposure to.

# teams 1 (BCSP) 4 (BCSP) 16 (BCDO) 64 (BCSP)
1 (BCSP) -- 0.699 0.74 0.698
4 (BCSP) 0.301 -- 0.594 0.672
16 (BCDO) 0.26 0.406 -- 0.644
64 (BCSP) 0.302 0.328 0.356 --

Generalization Test

This test compares the performance of the strongest method on average across runs 1-5 of the 1, 4, 16, and 64 team setting with 72 teams that none of them had training exposure to.

# teams 1 (BCSP) 4 (BCSP) 16 (BCDO) 64 (BCSP)
1 (BCSP) -- 0.405 0.375 0.331
4 (BCSP) 0.595 -- 0.453 0.422
16 (BCDO) 0.625 0.547 -- 0.436
64 (BCSP) 0.669 0.578 0.564 --

See our paper for further results and details.

📜 Cite us

@inproceedings{anglissvgc,
  title={VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pok{\'e}mon},
  author={Angliss, Cameron L and Cui, Jiaxun and Hu, Jiaheng and Rahman, Arrasy and Stone, Peter},
  booktitle={The 25th International Conference on Autonomous Agents and Multi-Agent Systems}
}

About

An AI benchmark for Pokémon VGC with agent implementations using multi-agent reinforcement learning, behavior cloning, LLMs, and heuristics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published