Add evaluation harness by fcogidi · Pull Request #37 · VectorInstitute/eval-agents

fcogidi · 2026-02-04T22:01:02Z

Summary

Add evaluation harness. This includes a wrapper around Langfuse's dataset.run_experiment to run evaluators/graders on langfuse datasets, as well as a method to run trace-based evaluations.

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Added a run_experiment, a thin wrapper around langfuse's dataset.run_experiment that includes fetching the dataset by name before running an experiment.
Added run_trace_evaluations as a second-pass evaluation that takes the output of run_experiment, fetches traces and runs trace evaluators.
Added run_experiment_with_trace_evals, which combines run_experiment and run_trace_evaluations in a workflow.
Added run_coroutine_sync utility function to be able to run an async coroutine synchronously from a python script or a notebook.

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:
N/A

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

N/A

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

Copilot

Pull request overview

This PR adds an evaluation harness for running experiments and trace-based evaluations on Langfuse datasets. It provides a wrapper around Langfuse's dataset.run_experiment with additional trace evaluation capabilities.

Changes:

Added run_experiment wrapper to fetch datasets by name and run experiments
Added run_trace_evaluations for second-pass trace-based evaluation
Added run_coroutine_sync utility to execute async code synchronously
Introduced type definitions for trace evaluations, metrics, and results

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
aieng-eval-agents/aieng/agent_evals/evaluation/types.py	Defines types for evaluation harness including protocols, dataclasses, and enums
aieng-eval-agents/aieng/agent_evals/evaluation/trace.py	Implements trace-based evaluation with retry logic and metric extraction
aieng-eval-agents/aieng/agent_evals/evaluation/experiment.py	Provides experiment wrapper functions for running evaluations
aieng-eval-agents/aieng/agent_evals/evaluation/init.py	Package initialization exporting public API
aieng-eval-agents/aieng/agent_evals/async_utils.py	Added `run_coroutine_sync` utility and refactored progress display

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T22:01:26Z

aieng-eval-agents/aieng/agent_evals/evaluation/trace.py

+def _trace_ready(trace: TraceWithFullDetails) -> bool:
+    """Check whether a trace has the required fields to evaluate."""
+    # This is a heuristic. The input and output fields being populated is a strong
+    # signal that the trace is complete, however, it not every field of the trace


Grammatical error: 'it not every field' should be 'not every field'.

Suggested change

# signal that the trace is complete, however, it not every field of the trace

# signal that the trace is complete; however, not every field of the trace

Add evaluation harness with experiment and trace evaluation support

b484152

fcogidi requested a review from Copilot February 4, 2026 22:01

fcogidi self-assigned this Feb 4, 2026

fcogidi added the enhancement New feature or request label Feb 4, 2026

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluation harness#37

Add evaluation harness#37
fcogidi wants to merge 1 commit intomainfrom
fc/add_eval_harness

fcogidi commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	# signal that the trace is complete, however, it not every field of the trace
	# signal that the trace is complete; however, not every field of the trace

Conversation

fcogidi commented Feb 4, 2026

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant