Skip to content

Fix set comprehension in per_class_scorer causing wrong overall metrics#567

Open
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/per-class-scorer-set-comprehension
Open

Fix set comprehension in per_class_scorer causing wrong overall metrics#567
Chessing234 wants to merge 1 commit intoallenai:mainfrom
Chessing234:fix/per-class-scorer-set-comprehension

Conversation

@Chessing234
Copy link
Copy Markdown

Summary

get_metric() in PerClassScorer computes overall precision/recall/F1 by summing per-label TP, FP, and FN counts. The comprehensions use {v for ...} (set) instead of [v for ...] (list), which deduplicates values before summing.

Bug: If two entity types share the same count (e.g., DISEASE: 3, CHEMICAL: 3), the set {3} collapses them into a single 3, and the sum is 3 instead of the correct 6. This silently produces incorrect overall metrics whenever any two labels happen to have equal counts.

Fix: Replace set comprehensions with list comprehensions on lines 72, 75, and 78 ({[, }]).

Why the existing test doesn't catch this: The test in test_per_class_scorer.py uses a scenario where every per-label count is 0 or 1 — all values are already distinct, so deduplication has no effect.

Test plan

  • Verify existing tests pass
  • Manually confirm: with TP = {A: 3, B: 3}, the old code sums to 3 (wrong), the new code sums to 6 (correct)

🤖 Generated with Claude Code

The overall precision/recall/F1 computation uses set comprehensions
({v for ...}) instead of list comprehensions ([v for ...]). Sets
deduplicate values, so if two entity types share the same TP/FP/FN
count, only one copy is summed — silently producing incorrect metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant