Skip to content

Conversation

@hannw
Copy link
Contributor

@hannw hannw commented Nov 20, 2025

  1. Created foundations for evaluating multiplayer game.
  2. Incorporated metrics including elo, trueskill, heuristic metrics and game theory eval.
  3. Added plotting utilities.

games = get_games(input_dir)
df = get_score_role_df_from_games(games)
if output_path:
df.to_csv(output_path)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df is a list not a Pandas DataFrame, df.to_csv will not work

day_vote_events.setdefault(day, [])
day_vote_events[day].append(json_data["data"])
if entry.data_type == "DayExileElectedDataEntry":
json_data = json.loads(entry["json_str"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent with line 164, change to entry.json_str


import pandas as pd

import json
Copy link

@yy6linda yy6linda Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant


self._bootstrap_elo(num_samples=elo_samples)
self._bootstrap_openskill(num_samples=openskill_samples)
self._run_gte_evaluation(num_samples=gte_samples)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log processing is done twice through iterate_voting_mini_game here and in line 463, this can be consolidated to reduce redundency?

w_elos = [elos[a] for a in werewolf_agents]
if v_elos and w_elos:
avg_v_elo = np.mean(v_elos)
avg_w_elo = np.mean(w_elos)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned with just comparing average team Elos. If you pair a top-tier Werewolf with a low-tier model, their average can still imply they're a solid team. But in reality, the low-tier model can drag the team and the elo for top-tier model can suffer unfairly.

irp, irp_std = stats.get_irp()
vss, vss_std = stats.get_vss()
print(" Voting Accuracy (Villager Team):")
print(f" IRP (Identification Precision): {irp:.2f} ± {irp_std * 1.96:.2f} (CI95) ({len(stats.irp_scores)} votes)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IRP, KSR and VSS are highly correlated, we should choose a primary metrics for ranking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants