|
1 | | -# EvalView — Pytest-style Testing for AI Agents |
| 1 | +# EvalView — Catch Agent Regressions Before You Ship |
2 | 2 |
|
3 | | -> An open-source testing framework for AI agents, with adapters for LangGraph, CrewAI, OpenAI Assistants, and Anthropic Claude. |
| 3 | +> Your agent worked yesterday. Today it's broken. What changed? |
4 | 4 |
|
5 | | -**EvalView** is pytest for AI agents—write readable test cases, run them in CI/CD, and block deploys when behavior, cost, or latency regresses. |
| 5 | +**EvalView catches agent regressions** — tool changes, output changes, cost spikes, and latency spikes — before they hit production. |
| 6 | + |
| 7 | +```bash |
| 8 | +evalview run --diff # Compare against golden baseline, block on regression |
| 9 | +``` |
6 | 10 |
|
7 | 11 | [](https://github.com/hidai25/eval-view/actions/workflows/ci.yml) |
8 | 12 | [](https://www.python.org/downloads/) |
|
26 | 30 |
|
27 | 31 | --- |
28 | 32 |
|
| 33 | +## The Problem |
| 34 | + |
| 35 | +You changed a prompt. Or swapped models. Or updated a tool. |
| 36 | + |
| 37 | +Now your agent: |
| 38 | +- ❌ Calls different tools than before |
| 39 | +- ❌ Returns different outputs for the same input |
| 40 | +- ❌ Costs 3x more than yesterday |
| 41 | +- ❌ Takes 5 seconds instead of 500ms |
| 42 | + |
| 43 | +You don't find out until users complain. |
| 44 | + |
| 45 | +## The Solution |
| 46 | + |
| 47 | +**EvalView detects these regressions in CI — before you deploy.** |
| 48 | + |
| 49 | +```bash |
| 50 | +# Save a working run as your baseline |
| 51 | +evalview golden save .evalview/results/xxx.json |
| 52 | + |
| 53 | +# Every future run compares against it |
| 54 | +evalview run --diff # Fails on REGRESSION |
| 55 | +``` |
| 56 | + |
| 57 | +--- |
| 58 | + |
29 | 59 | **Who is EvalView for?** |
30 | | -- Solo devs & small teams shipping agents to production |
31 | | -- Teams already using LangGraph / CrewAI / custom tools |
32 | | -- People who want *failing tests* in CI, not just dashboards |
| 60 | + |
| 61 | +Builders shipping tool-using agents who keep breaking behavior when they change prompts, models, or tools. |
| 62 | + |
| 63 | +- You're iterating fast on prompts and models |
| 64 | +- You've broken your agent more than once after "just a small change" |
| 65 | +- You want CI to catch regressions, not your users |
33 | 66 |
|
34 | 67 | Already using LangSmith, Langfuse, or other tracing? |
35 | | -Use them to *see* what happened. Use EvalView to **block bad behavior in CI before it hits prod.** |
| 68 | +Use them to *see* what happened. Use EvalView to **block bad behavior before it ships.** |
36 | 69 |
|
37 | 70 | > **Your Claude Code skills might be broken.** Claude silently ignores skills that exceed its [15k char budget](https://blog.fsck.com/2025/12/17/claude-code-skills-not-triggering/). [Check yours →](#skills-testing-claude-code--openai-codex) |
38 | 71 |
|
39 | 72 | --- |
40 | 73 |
|
| 74 | +## What EvalView Catches |
| 75 | + |
| 76 | +| Regression Type | What It Means | Status | |
| 77 | +|-----------------|---------------|--------| |
| 78 | +| **REGRESSION** | Score dropped — agent got worse | 🔴 Fix before deploy | |
| 79 | +| **TOOLS_CHANGED** | Agent uses different tools now | 🟡 Review before deploy | |
| 80 | +| **OUTPUT_CHANGED** | Same tools, different response | 🟡 Review before deploy | |
| 81 | +| **PASSED** | Matches baseline | 🟢 Ship it | |
| 82 | + |
| 83 | +EvalView runs in CI. When it detects a regression, your deploy fails. You fix it before users see it. |
| 84 | + |
| 85 | +--- |
| 86 | + |
41 | 87 | ## What is EvalView? |
42 | 88 |
|
43 | | -EvalView is a **testing framework for AI agents**. |
| 89 | +EvalView is a **regression testing framework for AI agents**. |
44 | 90 |
|
45 | 91 | It lets you: |
46 | 92 |
|
47 | | -- **Write tests in YAML** that describe inputs, expected tools, and acceptance thresholds |
48 | | -- **Turn real conversations into regression suites** (record → generate tests → re-run on every change) |
49 | | -- **Gate deployments in CI** on behavior, tool calls, cost, and latency |
50 | | -- Plug into **LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HTTP agents**, and more |
| 93 | +- **Save golden baselines** — snapshot a working agent run |
| 94 | +- **Detect regressions automatically** — tool changes, output changes, cost spikes, latency spikes |
| 95 | +- **Block bad deploys in CI** — fail the build when behavior regresses |
| 96 | +- Plug into **LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, MCP servers**, and more |
51 | 97 |
|
52 | | -Think: _"pytest / Playwright mindset, but for multi-step agents and tool-calling workflows."_ |
| 98 | +Think: _"Regression testing for agents. Like screenshot testing, but for behavior."_ |
53 | 99 |
|
54 | 100 | > **Note:** LLM-as-judge evaluations are probabilistic. Results may vary between runs. Use [Statistical Mode](#statistical-mode-variance-testing) for reliable pass/fail decisions. |
55 | 101 |
|
@@ -108,13 +154,13 @@ checks: |
108 | 154 | hallucination: true |
109 | 155 | ``` |
110 | 156 |
|
111 | | -**Regression detection** — fail if behavior drifts from baseline: |
| 157 | +**Regression detection** — fail if behavior changes from baseline: |
112 | 158 | ```bash |
113 | 159 | # Save a good run as baseline |
114 | 160 | evalview golden save .evalview/results/xxx.json |
115 | 161 |
|
116 | 162 | # Future runs compare against it |
117 | | -evalview run --diff # Fails on REGRESSION status |
| 163 | +evalview run --diff # Fails on REGRESSION or TOOLS_CHANGED |
118 | 164 | ``` |
119 | 165 |
|
120 | 166 | --- |
@@ -356,24 +402,25 @@ evalview run --diff |
356 | 402 |
|
357 | 403 | When you run with `--diff`, EvalView compares every test against its golden baseline and flags: |
358 | 404 |
|
359 | | -| Status | What It Means | |
360 | | -|--------|---------------| |
361 | | -| **STABLE** | Matches baseline - no action needed | |
362 | | -| **CHANGED** | Tools changed but output similar - review | |
363 | | -| **DRIFT** | Output changed but score stable - investigate | |
364 | | -| **REGRESSION** | Score dropped significantly - fix before deploy | |
| 405 | +| Status | What It Means | Action | |
| 406 | +|--------|---------------|--------| |
| 407 | +| **PASSED** | Matches baseline | 🟢 Ship it | |
| 408 | +| **TOOLS_CHANGED** | Agent uses different tools | 🟡 Review before deploy | |
| 409 | +| **OUTPUT_CHANGED** | Same tools, different response | 🟡 Review before deploy | |
| 410 | +| **REGRESSION** | Score dropped significantly | 🔴 Fix before deploy | |
365 | 411 |
|
366 | 412 | ### Example Output |
367 | 413 |
|
368 | 414 | ``` |
369 | | -━━━ Regression Detection ━━━ |
| 415 | +━━━ Golden Diff Report ━━━ |
370 | 416 |
|
371 | | -✓ test-stock-analysis STABLE |
372 | | -⚠ test-customer-support DRIFT output similarity: 78% |
373 | | -✗ test-code-review REGRESSION score dropped 15 points |
| 417 | +✓ PASSED test-stock-analysis |
| 418 | +⚠ TOOLS_CHANGED test-customer-support added: web_search |
| 419 | +~ OUTPUT_CHANGED test-summarizer similarity: 78% |
| 420 | +✗ REGRESSION test-code-review score dropped 15 points |
374 | 421 |
|
375 | | -Regressions detected: 1 |
376 | | -Drifts detected: 1 |
| 422 | +1 REGRESSION - fix before deploy |
| 423 | +1 TOOLS_CHANGED - review before deploy |
377 | 424 | ``` |
378 | 425 |
|
379 | 426 | ### Golden Commands |
@@ -1171,9 +1218,10 @@ If EvalView caught a regression, saved you debugging time, or kept your agent co |
1171 | 1218 | - [x] Tool categories for flexible matching |
1172 | 1219 | - [x] Multi-run flakiness detection |
1173 | 1220 | - [x] Skills testing (Claude Code, OpenAI Codex) |
| 1221 | +- [x] MCP server testing (`adapter: mcp`) |
| 1222 | +- [x] HTML diff reports (`--diff-report`) |
1174 | 1223 |
|
1175 | 1224 | **Coming Soon:** |
1176 | | -- [ ] MCP server testing |
1177 | 1225 | - [ ] Multi-turn conversation testing |
1178 | 1226 | - [ ] Grounded hallucination checking |
1179 | 1227 | - [ ] LLM-as-judge for skill guideline compliance |
|
0 commit comments