-
Notifications
You must be signed in to change notification settings - Fork 161
Werewolf Eval #565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Werewolf Eval #565
Conversation
hannw
commented
Nov 20, 2025
- Created foundations for evaluating multiplayer game.
- Incorporated metrics including elo, trueskill, heuristic metrics and game theory eval.
- Added plotting utilities.
1. report standard error instead of std dev. 2. apply bessel's correction to std calculation. 3. use standard error for GTE variance.
| games = get_games(input_dir) | ||
| df = get_score_role_df_from_games(games) | ||
| if output_path: | ||
| df.to_csv(output_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df is a list not a Pandas DataFrame, df.to_csv will not work
| day_vote_events.setdefault(day, []) | ||
| day_vote_events[day].append(json_data["data"]) | ||
| if entry.data_type == "DayExileElectedDataEntry": | ||
| json_data = json.loads(entry["json_str"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent with line 164, change to entry.json_str
|
|
||
| import pandas as pd | ||
|
|
||
| import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant
|
|
||
| self._bootstrap_elo(num_samples=elo_samples) | ||
| self._bootstrap_openskill(num_samples=openskill_samples) | ||
| self._run_gte_evaluation(num_samples=gte_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log processing is done twice through iterate_voting_mini_game here and in line 463, this can be consolidated to reduce redundency?
| w_elos = [elos[a] for a in werewolf_agents] | ||
| if v_elos and w_elos: | ||
| avg_v_elo = np.mean(v_elos) | ||
| avg_w_elo = np.mean(w_elos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned with just comparing average team Elos. If you pair a top-tier Werewolf with a low-tier model, their average can still imply they're a solid team. But in reality, the low-tier model can drag the team and the elo for top-tier model can suffer unfairly.
| irp, irp_std = stats.get_irp() | ||
| vss, vss_std = stats.get_vss() | ||
| print(" Voting Accuracy (Villager Team):") | ||
| print(f" IRP (Identification Precision): {irp:.2f} ± {irp_std * 1.96:.2f} (CI95) ({len(stats.irp_scores)} votes)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IRP, KSR and VSS are highly correlated, we should choose a primary metrics for ranking.