Add BenchmarkEvaluator with basic precision/recall computation by Muhammedswalihu · Pull Request #1870 · roboflow/supervision

Muhammedswalihu · 2025-07-06T17:07:55Z

Summary

This PR introduces a utility class BenchmarkEvaluator in supervision/metrics/benchmark.py to support benchmarking object detection results across different datasets or models.

Features

Computes basic precision and recall
Accepts Detections objects for ground truth and prediction
Optional support for class mapping and IoU thresholding (future extensions)
Includes a unit test at tests/metrics/test_benchmark.py

Motivation

Addresses Issue #1778: Improving object detection benchmarking process for unrelated datasets.

Let me know if you'd like me to extend this in future PRs with:

mAP, F1, or per-class metrics
Confusion matrix visualization
Colab notebook example

Thanks for the opportunity to contribute!

CLAassistant · 2025-07-06T17:08:01Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Muhammed Swalihu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Muhammedswalihu · 2025-07-06T17:16:02Z

Hi @SkalskiP @onuralpszr — I've submitted this PR for the BenchmarkEvaluator (Issue #1778 ). Let me know if you'd like me to fix the pre-commit error or extend this further. Thanks for reviewing!

soumik12345 · 2025-07-08T10:03:13Z

Hi @Muhammedswalihu, this seems like a really valuable feature!
Can you please replace the placeholder logic with a working one, provide a working example and testcases; and we can review the PR.

Muhammedswalihu · 2025-07-08T21:06:54Z

Hi @soumik12345 , thanks for the review!

I’ll go ahead and:

Replace the placeholder logic in BenchmarkEvaluator with full precision/recall/mAP computation,

Add a working demo example (maybe in a Colab notebook for clarity), and

Improve the test coverage with more edge cases and per-class evaluation.

Let me know if there’s anything specific you’d like to see included. Appreciate the opportunity — excited to take this further!

review-notebook-app · 2025-07-08T21:26:12Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Muhammedswalihu · 2025-07-08T21:30:34Z

Hi @soumik12345, I've added a Colab-style demo notebook BenchmarkEvaluator_Demo.ipynb!

It includes:

How to import and use the BenchmarkEvaluator
Per-class precision and recall visualization
A visual example comparing predicted and ground truth bounding boxes

This should help users understand and adopt the module more easily.

Let me know if you'd like me to polish or extend this notebook further!

soumik12345

Hi @Muhammedswalihu, thanks for providing the PoC!
Please feel free to proceed with the actual implementation.
Also, there's no need to commit the notebook to supervision, you can just attach a colab notebook in a comment when the PR is ready for review with the complete logic.

soumik12345 · 2025-07-09T14:03:12Z

supervision/metrics/benchmark.py

+        # TODO: Add class alignment, matching using IoU
+        tp = len(self.predictions.xyxy)  # Placeholder
+        fp = 0
+        fn = len(self.ground_truth.xyxy) - tp


The logic here is incomplete, please add the correct logic to compute precision and recall.

soumik12345 · 2025-07-09T14:04:18Z

tests/metrics/test_benchmark.py

+from supervision.metrics.benchmark import BenchmarkEvaluator
+
+
+def test_basic_precision_recall():


This too seems like a placeholder test; please proceed with the implementation and add comprehensive unit tests.

galafis · 2025-09-27T14:39:29Z

Great initiative on the BenchmarkEvaluator! This addresses a crucial need for standardized evaluation metrics. I'd like to offer some technical guidance to help you complete the implementation effectively.

Key Implementation Recommendations:

IoU-based Matching Algorithm: For proper TP/FP/FN computation, you'll need Hungarian assignment or greedy matching based on IoU thresholds:

def compute_matches(pred_boxes, gt_boxes, iou_threshold=0.5):
    # Compute IoU matrix
    # Apply optimal assignment (e.g., scipy.optimize.linear_sum_assignment)
    # Return matched pairs, unmatched predictions (FP), unmatched ground truth (FN)

Multi-class Support: Consider class-aware matching for per-class metrics:
- Group detections by class_id
- Compute metrics separately for each class
- Aggregate for overall performance
Confidence Thresholding: Implement confidence-based filtering for realistic evaluation scenarios
Standard Metrics: Beyond precision/recall, consider adding:
- F1-score
- Average Precision (AP) at different IoU thresholds
- Mean Average Precision (mAP)

Performance Considerations:

Vectorized IoU computation using numpy/supervision utilities
Batch processing for large evaluation sets
Memory-efficient handling of detection arrays

This evaluator will be invaluable for the community's benchmarking needs. Happy to provide more specific implementation details if needed!

Best regards,
Gabriel

Add BenchmarkEvaluator with unit test for precision and recall

9cf68c7

Muhammedswalihu requested review from SkalskiP and onuralpszr as code owners July 6, 2025 17:07

fix(pre_commit): 🎨 auto format pre-commit hooks

3a748e6

Add demo notebook for BenchmarkEvaluator

bf7cd7a

fix(pre_commit): 🎨 auto format pre-commit hooks

d03470b

soumik12345 suggested changes Jul 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BenchmarkEvaluator with basic precision/recall computation#1870

Add BenchmarkEvaluator with basic precision/recall computation#1870
Muhammedswalihu wants to merge 4 commits intoroboflow:developfrom
Muhammedswalihu:benchmark-evaluator

Muhammedswalihu commented Jul 6, 2025

Uh oh!

CLAassistant commented Jul 6, 2025

Uh oh!

Muhammedswalihu commented Jul 6, 2025

Uh oh!

soumik12345 commented Jul 8, 2025

Uh oh!

Muhammedswalihu commented Jul 8, 2025

Uh oh!

review-notebook-app bot commented Jul 8, 2025

Uh oh!

Muhammedswalihu commented Jul 8, 2025

Uh oh!

soumik12345 left a comment

Uh oh!

soumik12345 Jul 9, 2025

Uh oh!

soumik12345 Jul 9, 2025

Uh oh!

galafis commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		from supervision.metrics.benchmark import BenchmarkEvaluator


		def test_basic_precision_recall():

Conversation

Muhammedswalihu commented Jul 6, 2025

Summary

Features

Motivation

Uh oh!

CLAassistant commented Jul 6, 2025

Uh oh!

Muhammedswalihu commented Jul 6, 2025

Uh oh!

soumik12345 commented Jul 8, 2025

Uh oh!

Muhammedswalihu commented Jul 8, 2025

Uh oh!

review-notebook-app bot commented Jul 8, 2025

Uh oh!

Muhammedswalihu commented Jul 8, 2025

Uh oh!

soumik12345 left a comment

Choose a reason for hiding this comment

Uh oh!

soumik12345 Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

soumik12345 Jul 9, 2025

Choose a reason for hiding this comment

Uh oh!

galafis commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants