Skip to content

Commit 6d85cb8

Browse files
committed
refactor: sharpen positioning + rename diff statuses
- Update README: regression-focused positioning instead of generic 'pytest for agents' - Rename DiffStatus values: STABLE→PASSED, CHANGED→TOOLS_CHANGED, DRIFT→OUTPUT_CHANGED - Update CLI help text to match new positioning - Update HTML diff template CSS classes
1 parent a72673b commit 6d85cb8

File tree

4 files changed

+160
-97
lines changed

4 files changed

+160
-97
lines changed

README.md

Lines changed: 76 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
1-
# EvalView — Pytest-style Testing for AI Agents
1+
# EvalView — Catch Agent Regressions Before You Ship
22

3-
> An open-source testing framework for AI agents, with adapters for LangGraph, CrewAI, OpenAI Assistants, and Anthropic Claude.
3+
> Your agent worked yesterday. Today it's broken. What changed?
44
5-
**EvalView** is pytest for AI agents—write readable test cases, run them in CI/CD, and block deploys when behavior, cost, or latency regresses.
5+
**EvalView catches agent regressions** — tool changes, output changes, cost spikes, and latency spikes — before they hit production.
6+
7+
```bash
8+
evalview run --diff # Compare against golden baseline, block on regression
9+
```
610

711
[![CI](https://github.com/hidai25/eval-view/actions/workflows/ci.yml/badge.svg)](https://github.com/hidai25/eval-view/actions/workflows/ci.yml)
812
[![Python Version](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
@@ -26,30 +30,72 @@
2630
2731
---
2832

33+
## The Problem
34+
35+
You changed a prompt. Or swapped models. Or updated a tool.
36+
37+
Now your agent:
38+
- ❌ Calls different tools than before
39+
- ❌ Returns different outputs for the same input
40+
- ❌ Costs 3x more than yesterday
41+
- ❌ Takes 5 seconds instead of 500ms
42+
43+
You don't find out until users complain.
44+
45+
## The Solution
46+
47+
**EvalView detects these regressions in CI — before you deploy.**
48+
49+
```bash
50+
# Save a working run as your baseline
51+
evalview golden save .evalview/results/xxx.json
52+
53+
# Every future run compares against it
54+
evalview run --diff # Fails on REGRESSION
55+
```
56+
57+
---
58+
2959
**Who is EvalView for?**
30-
- Solo devs & small teams shipping agents to production
31-
- Teams already using LangGraph / CrewAI / custom tools
32-
- People who want *failing tests* in CI, not just dashboards
60+
61+
Builders shipping tool-using agents who keep breaking behavior when they change prompts, models, or tools.
62+
63+
- You're iterating fast on prompts and models
64+
- You've broken your agent more than once after "just a small change"
65+
- You want CI to catch regressions, not your users
3366

3467
Already using LangSmith, Langfuse, or other tracing?
35-
Use them to *see* what happened. Use EvalView to **block bad behavior in CI before it hits prod.**
68+
Use them to *see* what happened. Use EvalView to **block bad behavior before it ships.**
3669

3770
> **Your Claude Code skills might be broken.** Claude silently ignores skills that exceed its [15k char budget](https://blog.fsck.com/2025/12/17/claude-code-skills-not-triggering/). [Check yours →](#skills-testing-claude-code--openai-codex)
3871
3972
---
4073

74+
## What EvalView Catches
75+
76+
| Regression Type | What It Means | Status |
77+
|-----------------|---------------|--------|
78+
| **REGRESSION** | Score dropped — agent got worse | 🔴 Fix before deploy |
79+
| **TOOLS_CHANGED** | Agent uses different tools now | 🟡 Review before deploy |
80+
| **OUTPUT_CHANGED** | Same tools, different response | 🟡 Review before deploy |
81+
| **PASSED** | Matches baseline | 🟢 Ship it |
82+
83+
EvalView runs in CI. When it detects a regression, your deploy fails. You fix it before users see it.
84+
85+
---
86+
4187
## What is EvalView?
4288

43-
EvalView is a **testing framework for AI agents**.
89+
EvalView is a **regression testing framework for AI agents**.
4490

4591
It lets you:
4692

47-
- **Write tests in YAML** that describe inputs, expected tools, and acceptance thresholds
48-
- **Turn real conversations into regression suites** (record → generate tests → re-run on every change)
49-
- **Gate deployments in CI** on behavior, tool calls, cost, and latency
50-
- Plug into **LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HTTP agents**, and more
93+
- **Save golden baselines** — snapshot a working agent run
94+
- **Detect regressions automatically** — tool changes, output changes, cost spikes, latency spikes
95+
- **Block bad deploys in CI** — fail the build when behavior regresses
96+
- Plug into **LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, MCP servers**, and more
5197

52-
Think: _"pytest / Playwright mindset, but for multi-step agents and tool-calling workflows."_
98+
Think: _"Regression testing for agents. Like screenshot testing, but for behavior."_
5399

54100
> **Note:** LLM-as-judge evaluations are probabilistic. Results may vary between runs. Use [Statistical Mode](#statistical-mode-variance-testing) for reliable pass/fail decisions.
55101
@@ -108,13 +154,13 @@ checks:
108154
hallucination: true
109155
```
110156
111-
**Regression detection** — fail if behavior drifts from baseline:
157+
**Regression detection** — fail if behavior changes from baseline:
112158
```bash
113159
# Save a good run as baseline
114160
evalview golden save .evalview/results/xxx.json
115161

116162
# Future runs compare against it
117-
evalview run --diff # Fails on REGRESSION status
163+
evalview run --diff # Fails on REGRESSION or TOOLS_CHANGED
118164
```
119165

120166
---
@@ -356,24 +402,25 @@ evalview run --diff
356402

357403
When you run with `--diff`, EvalView compares every test against its golden baseline and flags:
358404

359-
| Status | What It Means |
360-
|--------|---------------|
361-
| **STABLE** | Matches baseline - no action needed |
362-
| **CHANGED** | Tools changed but output similar - review |
363-
| **DRIFT** | Output changed but score stable - investigate |
364-
| **REGRESSION** | Score dropped significantly - fix before deploy |
405+
| Status | What It Means | Action |
406+
|--------|---------------|--------|
407+
| **PASSED** | Matches baseline | 🟢 Ship it |
408+
| **TOOLS_CHANGED** | Agent uses different tools | 🟡 Review before deploy |
409+
| **OUTPUT_CHANGED** | Same tools, different response | 🟡 Review before deploy |
410+
| **REGRESSION** | Score dropped significantly | 🔴 Fix before deploy |
365411

366412
### Example Output
367413

368414
```
369-
━━━ Regression Detection ━━━
415+
━━━ Golden Diff Report ━━━
370416
371-
✓ test-stock-analysis STABLE
372-
⚠ test-customer-support DRIFT output similarity: 78%
373-
✗ test-code-review REGRESSION score dropped 15 points
417+
✓ PASSED test-stock-analysis
418+
⚠ TOOLS_CHANGED test-customer-support added: web_search
419+
~ OUTPUT_CHANGED test-summarizer similarity: 78%
420+
✗ REGRESSION test-code-review score dropped 15 points
374421
375-
Regressions detected: 1
376-
Drifts detected: 1
422+
1 REGRESSION - fix before deploy
423+
1 TOOLS_CHANGED - review before deploy
377424
```
378425

379426
### Golden Commands
@@ -1171,9 +1218,10 @@ If EvalView caught a regression, saved you debugging time, or kept your agent co
11711218
- [x] Tool categories for flexible matching
11721219
- [x] Multi-run flakiness detection
11731220
- [x] Skills testing (Claude Code, OpenAI Codex)
1221+
- [x] MCP server testing (`adapter: mcp`)
1222+
- [x] HTML diff reports (`--diff-report`)
11741223

11751224
**Coming Soon:**
1176-
- [ ] MCP server testing
11771225
- [ ] Multi-turn conversation testing
11781226
- [ ] Grounded hallucination checking
11791227
- [ ] LLM-as-judge for skill guideline compliance

evalview/cli.py

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,16 @@
4242
@click.group()
4343
@click.version_option(version="0.1.7")
4444
def main():
45-
"""EvalView - Testing framework for multi-step AI agents."""
45+
"""EvalView - Catch agent regressions before you ship.
46+
47+
Detects tool changes, output changes, cost spikes, and latency spikes
48+
by comparing against golden baselines.
49+
50+
Quick start:
51+
evalview quickstart # Try it in 2 minutes
52+
evalview run --diff # Compare against golden baseline
53+
evalview golden save result.json # Save a working run as baseline
54+
"""
4655
pass
4756

4857

@@ -1268,7 +1277,7 @@ async def _init_wizard_async(dir: str):
12681277
@click.option(
12691278
"--diff",
12701279
is_flag=True,
1271-
help="Compare against golden traces and show regressions. Use 'evalview golden save' to create baselines.",
1280+
help="Compare against golden baselines. Shows REGRESSION/TOOLS_CHANGED/OUTPUT_CHANGED/PASSED status.",
12721281
)
12731282
@click.option(
12741283
"--diff-report",
@@ -2422,16 +2431,16 @@ async def update_display():
24222431
console.print("\n[bold cyan]━━━ Golden Diff Report ━━━[/bold cyan]\n")
24232432

24242433
for test_name, trace_diff in diffs_found:
2425-
# Status-based display with proper terminology
2434+
# Status-based display with developer-friendly terminology
24262435
status = trace_diff.overall_severity
24272436
if status == DiffStatus.REGRESSION:
24282437
icon = "[red]✗ REGRESSION[/red]"
2429-
elif status == DiffStatus.DRIFT:
2430-
icon = "[yellow]⚠ DRIFT[/yellow]"
2431-
elif status == DiffStatus.CHANGED:
2432-
icon = "[dim]~ CHANGED[/dim]"
2438+
elif status == DiffStatus.TOOLS_CHANGED:
2439+
icon = "[yellow]⚠ TOOLS_CHANGED[/yellow]"
2440+
elif status == DiffStatus.OUTPUT_CHANGED:
2441+
icon = "[dim]~ OUTPUT_CHANGED[/dim]"
24332442
else:
2434-
icon = "[green]✓ STABLE[/green]"
2443+
icon = "[green]✓ PASSED[/green]"
24352444

24362445
console.print(f"{icon} [bold]{test_name}[/bold]")
24372446
console.print(f" Summary: {trace_diff.summary()}")
@@ -2454,23 +2463,23 @@ async def update_display():
24542463

24552464
console.print()
24562465

2457-
# Summary with proper terminology
2466+
# Summary with developer-friendly terminology
24582467
regressions = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.REGRESSION)
2459-
drifts = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.DRIFT)
2460-
changes = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.CHANGED)
2468+
tools_changed = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.TOOLS_CHANGED)
2469+
output_changed = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.OUTPUT_CHANGED)
24612470

24622471
if regressions > 0:
2463-
console.print(f"[red]✗ {regressions} REGRESSION(s) detected! Score dropped - review before deploying.[/red]\n")
2464-
elif drifts > 0:
2465-
console.print(f"[yellow]⚠ {drifts} DRIFT(s) detected - output changed but score stable[/yellow]\n")
2466-
elif changes > 0:
2467-
console.print(f"[dim]~ {changes} minor change(s) - tools changed but output similar[/dim]\n")
2472+
console.print(f"[red]✗ {regressions} REGRESSION(s) - score dropped, fix before deploy[/red]\n")
2473+
elif tools_changed > 0:
2474+
console.print(f"[yellow]⚠ {tools_changed} TOOLS_CHANGED - agent behavior shifted, review before deploy[/yellow]\n")
2475+
elif output_changed > 0:
2476+
console.print(f"[dim]~ {output_changed} OUTPUT_CHANGED - response changed, review before deploy[/dim]\n")
24682477
else:
24692478
# Check if any golden traces exist
24702479
goldens = store.list_golden()
24712480
matched = sum(1 for g in goldens if any(r.test_case == g.test_name for r in results))
24722481
if matched > 0:
2473-
console.print(f"[green]✓ STABLE - No differences from golden baseline ({matched} tests compared)[/green]\n")
2482+
console.print(f"[green]✓ PASSED - No differences from golden baseline ({matched} tests compared)[/green]\n")
24742483
elif goldens:
24752484
console.print("[yellow]No golden traces match these tests[/yellow]")
24762485
console.print("[dim]Save one with: evalview golden save " + str(results_file) + "[/dim]\n")

evalview/core/diff.py

Lines changed: 34 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,19 @@
1919

2020

2121
class DiffStatus(Enum):
22-
"""Status/category of differences found."""
22+
"""Status/category of differences found.
2323
24-
STABLE = "stable" # No significant differences - matches baseline
25-
CHANGED = "changed" # Minor changes (tools changed but output similar)
26-
DRIFT = "drift" # Output changed but score stable (behavioral drift)
27-
REGRESSION = "regression" # Score dropped significantly - likely a bug
24+
Four states with clear developer-friendly terminology:
25+
- PASSED: Matches baseline, safe to ship
26+
- TOOLS_CHANGED: Different tools used, behavior shifted
27+
- OUTPUT_CHANGED: Same tools, different response
28+
- REGRESSION: Score dropped, something got worse
29+
"""
30+
31+
PASSED = "passed" # No significant differences - matches baseline
32+
TOOLS_CHANGED = "tools_changed" # Tools changed (agent behavior shifted)
33+
OUTPUT_CHANGED = "output_changed" # Output changed but score stable
34+
REGRESSION = "regression" # Score dropped significantly - likely a bug
2835

2936

3037
# Alias for backwards compatibility
@@ -133,34 +140,31 @@ def compare(
133140
# Calculate latency diff
134141
latency_diff = actual.metrics.total_latency - golden.trace.metrics.total_latency
135142

136-
# Determine overall status using proper terminology:
137-
# - REGRESSION: score dropped significantly (>5 points)
138-
# - DRIFT: output changed (>20%) but score is stable
139-
# - CHANGED: tools changed but output similar
140-
# - STABLE: matches baseline
143+
# Determine overall status:
144+
# - REGRESSION: score dropped significantly (>5 points) - fix before deploy
145+
# - TOOLS_CHANGED: different tools used - review before deploy
146+
# - OUTPUT_CHANGED: same tools, different response - review before deploy
147+
# - PASSED: matches baseline - safe to ship
141148

142149
has_tool_changes = bool(tool_diffs)
143-
has_output_drift = output_diff.similarity < 0.80
144-
has_minor_output_change = output_diff.similarity < 0.95
150+
has_output_change = output_diff.similarity < 0.95
151+
has_significant_output_change = output_diff.similarity < 0.80
145152
score_dropped = score_diff < -5
146153

147-
has_differences = has_tool_changes or has_minor_output_change
154+
has_differences = has_tool_changes or has_output_change
148155

149156
if score_dropped:
150-
# Score dropped significantly - this is a REGRESSION
157+
# Score dropped significantly - REGRESSION
151158
overall_severity = DiffStatus.REGRESSION
152-
elif has_output_drift:
153-
# Output changed significantly but score stable - DRIFT
154-
overall_severity = DiffStatus.DRIFT
155-
elif has_tool_changes and not has_minor_output_change:
156-
# Only tools changed, output similar - minor CHANGED
157-
overall_severity = DiffStatus.CHANGED
158-
elif has_minor_output_change:
159-
# Small output change - DRIFT (less severe)
160-
overall_severity = DiffStatus.DRIFT
159+
elif has_tool_changes:
160+
# Tools changed - TOOLS_CHANGED (behavior shifted)
161+
overall_severity = DiffStatus.TOOLS_CHANGED
162+
elif has_output_change:
163+
# Output changed but same tools - OUTPUT_CHANGED
164+
overall_severity = DiffStatus.OUTPUT_CHANGED
161165
else:
162-
# No significant differences - STABLE
163-
overall_severity = DiffStatus.STABLE
166+
# No significant differences - PASSED
167+
overall_severity = DiffStatus.PASSED
164168

165169
return TraceDiff(
166170
test_name=golden.metadata.test_name,
@@ -196,7 +200,7 @@ def _compare_tools(
196200
position=g_start + i,
197201
golden_tool=g,
198202
actual_tool=a,
199-
severity=DiffStatus.CHANGED,
203+
severity=DiffStatus.TOOLS_CHANGED,
200204
message=f"Tool changed: '{g}' -> '{a}' at step {g_start + i + 1}",
201205
)
202206
)
@@ -210,7 +214,7 @@ def _compare_tools(
210214
position=g_start + i,
211215
golden_tool=g,
212216
actual_tool=None,
213-
severity=DiffStatus.DRIFT, # Missing tool is drift
217+
severity=DiffStatus.TOOLS_CHANGED, # Missing tool = behavior shifted
214218
message=f"Tool removed: '{g}' was at step {g_start + i + 1}",
215219
)
216220
)
@@ -224,7 +228,7 @@ def _compare_tools(
224228
position=a_start + i,
225229
golden_tool=None,
226230
actual_tool=a,
227-
severity=DiffStatus.STABLE, # Added tools are often OK
231+
severity=DiffStatus.TOOLS_CHANGED, # Added tool = behavior shifted
228232
message=f"Tool added: '{a}' at step {a_start + i + 1}",
229233
)
230234
)
@@ -253,9 +257,9 @@ def _compare_outputs(
253257

254258
# Determine severity (used internally, overall status determined in compare())
255259
if similarity >= 0.95:
256-
severity = DiffStatus.STABLE
260+
severity = DiffStatus.PASSED
257261
elif similarity >= 0.8:
258-
severity = DiffStatus.DRIFT
262+
severity = DiffStatus.OUTPUT_CHANGED
259263
else:
260264
severity = DiffStatus.REGRESSION
261265

0 commit comments

Comments
 (0)