Werewolf Eval #565

hannw · 2025-11-20T17:02:05Z

Created foundations for evaluating multiplayer game.
Incorporated metrics including elo, trueskill, heuristic metrics and game theory eval.
Added plotting utilities.

1. report standard error instead of std dev. 2. apply bessel's correction to std calculation. 3. use standard error for GTE variance.

…enskill and elo

… save memory

yy6linda · 2025-11-21T21:14:23Z

kaggle_environments/envs/werewolf/eval/loaders.py

+    games = get_games(input_dir)
+    df = get_score_role_df_from_games(games)
+    if output_path:
+        df.to_csv(output_path)


df is a list not a Pandas DataFrame, df.to_csv will not work

yy6linda · 2025-11-21T21:19:42Z

kaggle_environments/envs/werewolf/eval/loaders.py

+                    day_vote_events.setdefault(day, [])
+                    day_vote_events[day].append(json_data["data"])
+                if entry.data_type == "DayExileElectedDataEntry":
+                    json_data = json.loads(entry["json_str"])


Inconsistent with line 164, change to entry.json_str

yy6linda · 2025-11-21T21:20:24Z

kaggle_environments/envs/werewolf/eval/loaders.py

+
+import pandas as pd
+
+import json


This is redundant

yy6linda · 2025-11-21T21:37:55Z

kaggle_environments/envs/werewolf/eval/metrics.py

+
+        self._bootstrap_elo(num_samples=elo_samples)
+        self._bootstrap_openskill(num_samples=openskill_samples)
+        self._run_gte_evaluation(num_samples=gte_samples)


Log processing is done twice through iterate_voting_mini_game here and in line 463, this can be consolidated to reduce redundency?

yy6linda · 2025-11-21T21:51:38Z

kaggle_environments/envs/werewolf/eval/metrics.py

+            w_elos = [elos[a] for a in werewolf_agents]
+            if v_elos and w_elos:
+                avg_v_elo = np.mean(v_elos)
+                avg_w_elo = np.mean(w_elos)


I am concerned with just comparing average team Elos. If you pair a top-tier Werewolf with a low-tier model, their average can still imply they're a solid team. But in reality, the low-tier model can drag the team and the elo for top-tier model can suffer unfairly.

yy6linda · 2025-11-21T22:00:45Z

kaggle_environments/envs/werewolf/eval/metrics.py

+            irp, irp_std = stats.get_irp()
+            vss, vss_std = stats.get_vss()
+            print("  Voting Accuracy (Villager Team):")
+            print(f"    IRP (Identification Precision): {irp:.2f} ± {irp_std * 1.96:.2f} (CI95) ({len(stats.irp_scores)} votes)")


IRP, KSR and VSS are highly correlated, we should choose a primary metrics for ranking.

hannw added 10 commits November 20, 2025 15:43

add win rate, IRP, KSR, VSS, survival rate into key metrics

eea0a37

Add game theoretic eval into metrics

c1d2e64

change bootstrap solve to staticmethod

6efbb25

Add utilities to plot game theoretic eval

e92ccfc

add bar plot

c7e8dc6

add elo and openskill evaluations

6e08e10

parallelize game loading

77cd645

Report standard error instead of std dev

b3a5e23

1. report standard error instead of std dev. 2. apply bessel's correction to std calculation. 3. use standard error for GTE variance.

Plot the task marginal importance for GTE

32738bd

Add bootstrap for elo variance estimation

26d07c5

hannw force-pushed the werewolf_eval branch from 474bcbc to 26d07c5 Compare November 21, 2025 00:57

hannw added 7 commits November 20, 2025 18:36

Update the plots to use plotly

2fc1070

Add utilities to save multiple output paths and types for the figures

4ea84a6

parallelize bootstrap sampling

8ee1722

Add bootstrap of openskill rating for its variance and parallelize op…

c3f8ddf

…enskill and elo

Improve multiprocessing memory usage

2a36b32

Report 95% confidence interval instead of stderr

0f87160

Use multithreading for bootstrap to save memory

3b60d5c

hannw requested review from MartynaPlomecka and yy6linda November 21, 2025 17:46

hannw added 3 commits November 21, 2025 10:14

Use multiprocessing.Pool and initializer with global for bootstrap to…

036723c

… save memory

Use multiprocessing spawn to safely initialize multiprocessing

d48a51e

Optimizing sample generation using iterator for bootstrap

431c619

yy6linda reviewed Nov 21, 2025

View reviewed changes

hannw added 6 commits November 21, 2025 15:13

Sort elo and openskill plots

e069c9f

Refactor GameResult to store only the raw data needed

9f680ab

Add dependencies for eval

7d8c04b

Fix plots to use consistent agent coloring

b0d567c

Use gte rating to sort default agent ordering in plots

6b4df9f

Add win dependent metrics

f75935b

hannw added 4 commits November 21, 2025 23:11

Resolve jax deadlock issue

503dc77

Add pareto frontier plot

fc277f4

Add docstrings

bae2259

Fix plotly google.colab compatibility issue

ffef1d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Werewolf Eval #565

Werewolf Eval #565

Uh oh!

hannw commented Nov 20, 2025

Uh oh!

yy6linda Nov 21, 2025

Uh oh!

yy6linda Nov 21, 2025

Uh oh!

yy6linda Nov 21, 2025 •

edited

Loading

Uh oh!

yy6linda Nov 21, 2025

Uh oh!

yy6linda Nov 21, 2025

Uh oh!

yy6linda Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		import pandas as pd

		import json

Werewolf Eval #565

Are you sure you want to change the base?

Werewolf Eval #565

Uh oh!

Conversation

hannw commented Nov 20, 2025

Uh oh!

yy6linda Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

yy6linda Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

yy6linda Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yy6linda Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

yy6linda Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

yy6linda Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yy6linda Nov 21, 2025 •

edited

Loading