Skip to content

Add evaluation harness#37

Draft
fcogidi wants to merge 1 commit intomainfrom
fc/add_eval_harness
Draft

Add evaluation harness#37
fcogidi wants to merge 1 commit intomainfrom
fc/add_eval_harness

Conversation

@fcogidi
Copy link
Collaborator

@fcogidi fcogidi commented Feb 4, 2026

Summary

Add evaluation harness. This includes a wrapper around Langfuse's dataset.run_experiment to run evaluators/graders on langfuse datasets, as well as a method to run trace-based evaluations.

Clickup Ticket(s): N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🔧 Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test improvements
  • 🔒 Security fix

Changes Made

  • Added a run_experiment, a thin wrapper around langfuse's dataset.run_experiment that includes fetching the dataset by name before running an experiment.
  • Added run_trace_evaluations as a second-pass evaluation that takes the output of run_experiment, fetches traces and runs trace evaluators.
  • Added run_experiment_with_trace_evals, which combines run_experiment and run_trace_evaluations in a workflow.
  • Added run_coroutine_sync utility function to be able to run an async coroutine synchronously from a python script or a notebook.

Testing

  • Tests pass locally (uv run pytest tests/)
  • Type checking passes (uv run mypy <src_dir>)
  • Linting passes (uv run ruff check src_dir/)
  • Manual testing performed (describe below)

Manual testing details:
N/A

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

N/A

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Documentation updated (if applicable)
  • No sensitive information (API keys, credentials) exposed

@fcogidi fcogidi requested a review from Copilot February 4, 2026 22:01
@fcogidi fcogidi self-assigned this Feb 4, 2026
@fcogidi fcogidi added the enhancement New feature or request label Feb 4, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an evaluation harness for running experiments and trace-based evaluations on Langfuse datasets. It provides a wrapper around Langfuse's dataset.run_experiment with additional trace evaluation capabilities.

Changes:

  • Added run_experiment wrapper to fetch datasets by name and run experiments
  • Added run_trace_evaluations for second-pass trace-based evaluation
  • Added run_coroutine_sync utility to execute async code synchronously
  • Introduced type definitions for trace evaluations, metrics, and results

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
aieng-eval-agents/aieng/agent_evals/evaluation/types.py Defines types for evaluation harness including protocols, dataclasses, and enums
aieng-eval-agents/aieng/agent_evals/evaluation/trace.py Implements trace-based evaluation with retry logic and metric extraction
aieng-eval-agents/aieng/agent_evals/evaluation/experiment.py Provides experiment wrapper functions for running evaluations
aieng-eval-agents/aieng/agent_evals/evaluation/init.py Package initialization exporting public API
aieng-eval-agents/aieng/agent_evals/async_utils.py Added run_coroutine_sync utility and refactored progress display

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def _trace_ready(trace: TraceWithFullDetails) -> bool:
"""Check whether a trace has the required fields to evaluate."""
# This is a heuristic. The input and output fields being populated is a strong
# signal that the trace is complete, however, it not every field of the trace
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammatical error: 'it not every field' should be 'not every field'.

Suggested change
# signal that the trace is complete, however, it not every field of the trace
# signal that the trace is complete; however, not every field of the trace

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant