Skip to content

An AI benchmark for Pokémon VGC with agent implementations using multi-agent reinforcement learning, behavior cloning, LLMs, and heuristics

License

Notifications You must be signed in to change notification settings

cameronangliss/vgc-bench

Repository files navigation

VGC-Bench

This is the official code for VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon.

This benchmark includes:

  • multi-agent reinforcement learning (RL) with 4 Policy Space Response Oracle (PSRO) algorithms to fine-tune an agent initialized either randomly or with the output of the BC pipeline
  • a behavior cloning (BC) pipeline to gather human demonstrations, process them into state-action pairs, and train a model to imitate human play
  • a basic Large Language Model (LLM) player that any LLM can easily be plugged into
  • 3 heuristic players from poke-env

🛠️ Setup

Prerequisites:

  1. Python (I use v3.13)
  2. NodeJS and npm (whatever pokemon-showdown requires)

Run the following to ensure that pokemon showdown is configured:

git submodule update --init --recursive
cd pokemon-showdown
npm i
node pokemon-showdown start --no-security

Let that run until you see the following text:

RESTORE CHATROOM: lobby
RESTORE CHATROOM: staff
Worker 1 now listening on 0.0.0.0:8000
Test your server at http://localhost:8000

This shows that you can locally host the showdown server.

Install project dependencies by running:

pip install .[dev]

👨‍💻 How to use

NOTE: Unless you're playing your policy on the live Pokémon Showdown servers with play.py, you must locally host your own server by running node pokemon-showdown start <PORT> --no-security from pokemon-showdown/ (done automatically if using bash scripts).

All .py files in vgc_bench/ are scripts and (with the exception of scrape_data.py and visualize.py) have --help text. By contrast, all .py files in vgc_bench/src/ are not scripts, and are not intended to be run standalone.

🏆 Population-based Reinforcement Learning

The training code offers the following PSRO algorithms:

  • pure self-play
  • fictitious play
  • double oracle method
  • policy exploitation

...as well as some special training options:

  • initializing the policy with the output of the BC pipeline (requires manually copying the BC policy file into the training run's save folder)
  • frame stacking with specified number of frames
  • excluding mirror matches (p1 and p2 using the same team)
  • starting agent with random teampreview at the beginning of each game

See train.sh for running multiple training runs simultaneously with automatic pokemon-showdown server management.

📚 Behavior Cloning

  1. scrape_logs.py scrapes logs from the Pokémon Showdown replay database, automatically filtering out bad logs and only scraping logs with open team sheets (OTS)
    • optional parallelization (strongly recommended)
    • if you don't need logs after 01/09/2026, just download our pre-scraped dataset of logs: vgc-battle-logs
  2. logs2trajs.py parses the logs into trajectories composed of state-action transitions
    • optional parallelization (strongly recommended)
    • --min_rating and --only_winner can be used to filter out low-Elo and losing trajectories respectively
  3. pretrain.py uses the gathered trajectories to train a policy with behavior cloning
    • frame stacking with specified number of frames
    • configurable fraction of dataset to load into memory at any given time (if not set low enough, program may run out of memory)
    • see pretrain.sh for running behavior cloning with automatic pokemon-showdown server management

🤖 LLMs

See llm.py for the provided LLMPlayer wrapper class. We use meta-llama/Meta-Llama-3.1-8B-Instruct, but the user may replace logic in the setup_llm and get_response methods to use a different LLM.

🎲 Heuristics

See poke-env for detailed examples of using the heuristic players. For example:

import asyncio

from poke_env import cross_evaluate
from poke_env.player import MaxBasePowerPlayer, RandomPlayer, SimpleHeuristicsPlayer

random_player = RandomPlayer()
mbp_player = MaxBasePowerPlayer()
sh_player = SimpleHeuristicsPlayer()
results = asyncio.run(cross_evaluate([random_player, mbp_player, sh_player], n_challenges=100))
print(results)

📊 Evaluation

  • eval.py runs the cross-play evaluation, performance test, generalization test, and ranking algorithm as described in our paper (see above)
    • see eval.sh for running multiple evaluations simultaneously with automatic pokemon-showdown server management
  • play.py loads a saved policy onto the live Pokémon Showdown servers, where the policy can receive challenges from other users or enter the online Elo ladder
  • visualize.py processes cross-evaluation results into heatmaps and features conversion functions for LaTeX and Markdown formats

Cross-evaluation of all AI agents

For each run, 200 battles were used to compare agents, except for LLM player which was compared with 20 battles. The heatmap below averages the results of 5 independent training runs for each trainable agent, accounting for 1000 total battles in each agent comparison, and 100 battles per comparison for the LLM player.

figures/heatmaps_avg.png

Legend: R = random player, MBP = max base power player, SH = simple heuristics player, LLM = LLM player, SP = self-play agent, FP = fictitious play agent, DO = double oracle agent, BC = behavior cloning agent, BCSP = self-play agent initialized with behavior cloning, BCFP = fictitious play agent initialized with behavior cloning, BCDO = double oracle agent initialized with behavior cloning

Performance Test

This test compares the performance of the strongest method on average across runs 1-5 of the 1, 4, 16, and 64 team setting with the one team that they all had training exposure to.

# teams 1 (BCSP) 4 (BCSP) 16 (BCDO) 64 (BCSP)
1 (BCSP) -- 0.699 0.74 0.698
4 (BCSP) 0.301 -- 0.594 0.672
16 (BCDO) 0.26 0.406 -- 0.644
64 (BCSP) 0.302 0.328 0.356 --

Generalization Test

This test compares the performance of the strongest method on average across runs 1-5 of the 1, 4, 16, and 64 team setting with 72 teams that none of them had training exposure to.

# teams 1 (BCSP) 4 (BCSP) 16 (BCDO) 64 (BCSP)
1 (BCSP) -- 0.405 0.375 0.331
4 (BCSP) 0.595 -- 0.453 0.422
16 (BCDO) 0.625 0.547 -- 0.436
64 (BCSP) 0.669 0.578 0.564 --

See our paper for further results and details.

📜 Cite us

@inproceedings{anglissvgc,
  title={VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pok{\'e}mon},
  author={Angliss, Cameron L and Cui, Jiaxun and Hu, Jiaheng and Rahman, Arrasy and Stone, Peter},
  booktitle={The 25th International Conference on Autonomous Agents and Multi-Agent Systems}
}

About

An AI benchmark for Pokémon VGC with agent implementations using multi-agent reinforcement learning, behavior cloning, LLMs, and heuristics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published