Skip to content

dashidhy/DanLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DanLM: Tokenization Is All You Need to Master Complex Card Games

A game AI that learns entirely from raw game history via self-play reinforcement learning, with truly zero domain knowledge — no policy priors, no hand-crafted features, what you see is what you get, surpassing hand-crafted SOTA in GuanDan (掼蛋), a complex 4-player trick-taking card game hugely popular across China.

Disclaimer: This project was developed through ~100% vibe coding (powered by Claude Opus 4.6). While extensively tested, the code and documentation may contain critical bugs, hallucinations, or inaccuracies. We are actively working on fixing these issues. Use at your own risk and verify critical results independently. If you encounter any problems, feel free to open an issue.


Architecture

Previous SOTA (DanZero) — requires domain knowledge

graph LR
    feat["567-dim Hand-Crafted Features<br/>(manual game history filtering, pre-calculated statistics, secondary information like remaining cards, etc.)"] --> mlp["MLP"] --> q1["Q(s,a)"]

    style feat fill:#ffebee,stroke:#E53935
    style mlp fill:#ffebee,stroke:#E53935
    style q1 fill:#ffebee,stroke:#E53935
Loading

DanLM (This Work) — zero domain knowledge

graph LR
    history["Game History<br/>(tokenized play record, raw public information only)"] --> encoder["TinyLM Encoder"]
    hand["Hand + Action<br/>(simple count/onehot vectors)"] --> handmlp["Hand MLP"]

    encoder -->|"context"| qhead["Q-Value Head"]
    encoder -->|"hidden states"| ntp["Next-Token Prediction<br/>(auxiliary task)"]
    handmlp -->|"hand context"| qhead
    qhead --> q2["Q(s,a)"]

    style ntp fill:#f3e5f5,stroke:#9C27B0

    style history fill:#e8f5e9,stroke:#43A047
    style hand fill:#e8f5e9,stroke:#43A047
    style encoder fill:#fff3e0,stroke:#FF9800
    style handmlp fill:#e8f5e9,stroke:#4CAF50
    style qhead fill:#e8f4f8,stroke:#2196F3
    style q2 fill:#e8f5e9,stroke:#43A047
Loading

Key Idea

Existing card game AI systems (DouZero, DanZero, PerfectDou, Suphx, etc.) usually rely on carefully designed hand-crafted features, including too much domain knowledge like pre-calculated statistics and secondary information of the game.

DanLM shows that raw game history speaks for itself. The input is simply the raw play-by-play game transcript — who played what cards, in order — tokenized like natural language. The model learns what matters from scratch, through self-play RL and causal sequence modeling.

Aspect Previous SOTA (DanZero) DanLM
State features 567-dim hand-crafted Raw token sequence
Architecture MLP TinyLM + MLP
Domain knowledge Yes No
Training DMC self-play DMC self-play + NTP

Evaluation

Models

Model Main Architecture State Representation
DanZero MLP 567-dim hand-crafted
DanZero V1T MLP 964-dim hand-crafted
DanLM V1 causal Transformer raw info tokenization (~90 vocab)

DanZero V1T is our enhanced reproduction of DanZero with bug fixes, stronger state representation, and fully model-driven action selection (original DanZero uses heuristics for tribute). It exceeds the original DanZero's performance under the same training settings, serving as a stronger hand-crafted baseline.

Metrics

  • Single-Round Win Rate: One round with a random level, random tribute/back, and random deal. The team whose player finishes first wins. This is a stricter metric with less variance amplification.
  • Whole-Game Win Rate: A complete game consisting of multiple rounds with level progression from 2 to A. This is the metric used in the original DanZero paper (Table I). Single-round advantages are amplified over multiple rounds (e.g., 54% single-round → ~66% whole-game).

Baseline Bots

We include 16 competition bots from the 1st National GuanDan AI Algorithm Competition (首届中国人工智能掼蛋算法大赛) as standardized evaluation baselines, consistent with the DanZero paper's evaluation protocol.

Key finding: competition rankings do NOT reflect actual bot strength. Many bots have critical bugs in their source code that caused crashes during the competition. For example, njupt-guandan-ai only won a consolation prize due to a None-check bug causing ~49% of games to crash — but it is actually the strongest bot after the fix. We fixed all bots one-by-one and evaluate against their bug-free versions.

See baselines/ for the full bots collection.

Results

Evaluation results of DanLM against DanZero, DanZero V1T, and the 5 strongest baseline bots: njupt-guandan-ai, chick-squad, guanglan-iot, egg-expert, and egg-pancake.

Random single-round win rate (1000 rounds, seed=42):

DanZero DanZero V1T njupt-guandan-ai chick-squad guanglan-iot egg-expert egg-pancake
DanZero - - 71.4% 74.6% 77.3% 78.8% 79.1%
DanZero V1T 62.1% - 77.8% 80.8% 81.8% 83.2% 85.9%
DanLM 65.1% 59.6% 81.9% 82.1% 80.9% 82.3% 85.9%

Whole-game win rate (1000 games, seed=42):

DanZero DanZero V1T njupt-guandan-ai chick-squad guanglan-iot egg-expert egg-pancake
DanZero - - 95.5% 98.9% 98.6% 99.1% 99.0%
DanZero V1T 87.5% - 99.1% 99.9% 100.0% 100.0% 99.8%
DanLM 97.5% 74.9% 100.0% 100.0% 99.8% 99.9% 99.9%

Quick Start to Reproduce the Results

Requirements

  • Python 3.12
  • macOS ARM64 (Apple Silicon)

Install

pip install torch numpy onnxruntime
pip install fastapi uvicorn  # for UI

Evaluate

# DanLM vs random
PYTHONPATH=. python scripts/evaluate.py \
    --model ckpts/DanLM_v1/dansformer_v1_best_eval.pt \
    --games 100

# DanLM vs DanZero V1T (hand-crafted SOTA)
PYTHONPATH=. python scripts/evaluate.py \
    --model ckpts/DanLM_v1/dansformer_v1_best_eval.pt \
    --model-b ckpts/DanZero_v3_rep_v1t/v3_rep_v1t_best_eval_001_int8.onnx \
    --games 500

# DanLM vs baseline bot
PYTHONPATH=. python scripts/evaluate.py \
    --model ckpts/DanLM_v1/dansformer_v1_best_eval.pt \
    --model-b bot:fin-njupt-guandan-ai \
    --games 500

# Whole-game evaluation
PYTHONPATH=. python scripts/evaluate_game.py \
    --model ckpts/DanLM_v1/dansformer_v1_best_eval.pt \
    --model-b ckpts/DanZero_v3_rep_v1t/v3_rep_v1t_best_eval_001_int8.onnx \
    --games 100

Web UI - Enjoy playing with the AI yourself!

Game UI

Play GuanDan against the AI in your browser with built-in AI Hint support showing Q-value estimates for each legal play.

PYTHONPATH=. python ui/server.py
# Open http://localhost:8000

Choose from 3 AI agents:

  • DanZero V0 — MLP baseline
  • DanZero V1T — Hand-crafted feature SOTA
  • DanLM V1 — Feature-free TinyLM agent (ours)

License

Apache License 2.0 with additional non-commercial restriction. See LICENSE for details.

Free for academic research and personal use. Commercial use requires written permission from the author.

References

  • DanZero: Lu et al., "DanZero: Mastering GuanDan Game with Reinforcement Learning", AAAI 2023. [paper]
  • DouZero: Zha et al., "DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning", ICML 2021. [paper] [code]
  • PerfectDou: Yang et al., "PerfectDou: Dominating DouDizhu with Perfect Information Distillation", NeurIPS 2022. [paper]
  • Suphx: Li et al., "Suphx: Mastering Mahjong with Deep Reinforcement Learning", 2020. [paper]

Citation

If you use this work, please cite this repository.