Eval Protocol (EP)

Stop guessing which AI model to use. Build a data-driven model leaderboard.

With hundreds of models and configs, you need objective data to choose the right one for your use case. EP helps you evaluate real traces, compare models, and visualize results locally.

🚀 Features

Pytest authoring: @evaluation_test decorator to configure evaluations
Robust rollouts: Handles flaky LLM APIs and parallel execution
Integrations: Works with Langfuse, LangSmith, Braintrust, Responses API
Agent support: LangGraph and Pydantic AI
MCP RL envs: Build reinforcement learning environments with MCP
Built-in benchmarks: AIME, tau-bench
LLM judge: Stack-rank models using pairwise Arena-Hard-Auto
Local UI: Pivot/table views for real-time analysis

⚡ Quickstart (no labels needed)

Install with your tracing platform extras and set API keys:

pip install 'eval-protocol[langfuse]'

# Model API keys (set what you need)
export OPENAI_API_KEY=...
export FIREWORKS_API_KEY=...
export GEMINI_API_KEY=...

# Platform keys
export LANGFUSE_PUBLIC_KEY=...
export LANGFUSE_SECRET_KEY=...
export LANGFUSE_HOST=https://your-deployment.com  # optional

Minimal evaluation using the built-in AHA judge:

from datetime import datetime
import pytest

from eval_protocol import (
    evaluation_test,
    aha_judge,
    EvaluationRow,
    SingleTurnRolloutProcessor,
    DynamicDataLoader,
    create_langfuse_adapter,
)


def langfuse_data_generator() -> list[EvaluationRow]:
    adapter = create_langfuse_adapter()
    return adapter.get_evaluation_rows(
        to_timestamp=datetime.utcnow(),
        limit=20,
        sample_size=5,
    )


@pytest.mark.parametrize(
    "completion_params",
    [
        {"model": "openai/gpt-4.1"},
        {"model": "fireworks_ai/accounts/fireworks/models/gpt-oss-120b"},
    ],
)
@evaluation_test(
    data_loaders=DynamicDataLoader(generators=[langfuse_data_generator]),
    rollout_processor=SingleTurnRolloutProcessor(),
)
async def test_llm_judge(row: EvaluationRow) -> EvaluationRow:
    return await aha_judge(row)

Run it:

pytest -q -s

The pytest output includes local links for a leaderboard and row-level traces (pivot/table) at http://localhost:8000.

Installation

This library requires Python >= 3.10.

pip

pip install eval-protocol

uv (recommended)

# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Add to your project
uv add eval-protocol

📚 Resources

Documentation – Guides and API reference
Discord – Community
GitHub – Source and examples

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 339 Commits
.github		.github
.vscode		.vscode
assets		assets
conf		conf
development		development
docs		docs
eval_protocol		eval_protocol
examples		examples
local_evals		local_evals
scripts		scripts
tests		tests
typescript		typescript
vendor/tau2		vendor/tau2
vite-app		vite-app
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
test_event_bus_helper.py		test_event_bus_helper.py
uv.lock		uv.lock
versioneer.py		versioneer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Eval Protocol (EP)

🚀 Features

⚡ Quickstart (no labels needed)

Installation

pip

uv (recommended)

📚 Resources

License

About

Uh oh!

Releases 123

Contributors 8

Uh oh!

Languages

License

eval-protocol/python-sdk

Folders and files

Latest commit

History

Repository files navigation

Eval Protocol (EP)

🚀 Features

⚡ Quickstart (no labels needed)

Installation

pip

uv (recommended)

📚 Resources

License

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 123

Contributors 8

Uh oh!

Languages