refactor: sharpen positioning + rename diff statuses

hidai25 · hidai25 · commit 6d85cb8757e2 · 2025-12-31T08:31:43.000+02:00
- Update README: regression-focused positioning instead of generic 'pytest for agents'
- Rename DiffStatus values: STABLE→PASSED, CHANGED→TOOLS_CHANGED, DRIFT→OUTPUT_CHANGED
- Update CLI help text to match new positioning
- Update HTML diff template CSS classes
diff --git a/README.md b/README.md
@@ -1,8 +1,12 @@
-# EvalView — Pytest-style Testing for AI Agents
+# EvalView — Catch Agent Regressions Before You Ship
 
-> An open-source testing framework for AI agents, with adapters for LangGraph, CrewAI, OpenAI Assistants, and Anthropic Claude.
+> Your agent worked yesterday. Today it's broken. What changed?
 
-**EvalView** is pytest for AI agents—write readable test cases, run them in CI/CD, and block deploys when behavior, cost, or latency regresses.
+**EvalView catches agent regressions** — tool changes, output changes, cost spikes, and latency spikes — before they hit production.
+
+```bash
+evalview run --diff  # Compare against golden baseline, block on regression
+```
 
 [![CI](https://github.com/hidai25/eval-view/actions/workflows/ci.yml/badge.svg)](https://github.com/hidai25/eval-view/actions/workflows/ci.yml)
 [![Python Version](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
@@ -26,30 +30,72 @@
 
 ---
 
+## The Problem
+
+You changed a prompt. Or swapped models. Or updated a tool.
+
+Now your agent:
+- ❌ Calls different tools than before
+- ❌ Returns different outputs for the same input
+- ❌ Costs 3x more than yesterday
+- ❌ Takes 5 seconds instead of 500ms
+
+You don't find out until users complain.
+
+## The Solution
+
+**EvalView detects these regressions in CI — before you deploy.**
+
+```bash
+# Save a working run as your baseline
+evalview golden save .evalview/results/xxx.json
+
+# Every future run compares against it
+evalview run --diff  # Fails on REGRESSION
+```
+
+---
+
 **Who is EvalView for?**
-- Solo devs & small teams shipping agents to production
-- Teams already using LangGraph / CrewAI / custom tools
-- People who want *failing tests* in CI, not just dashboards
+
+Builders shipping tool-using agents who keep breaking behavior when they change prompts, models, or tools.
+
+- You're iterating fast on prompts and models
+- You've broken your agent more than once after "just a small change"
+- You want CI to catch regressions, not your users
 
 Already using LangSmith, Langfuse, or other tracing?
-Use them to *see* what happened. Use EvalView to **block bad behavior in CI before it hits prod.**
+Use them to *see* what happened. Use EvalView to **block bad behavior before it ships.**
 
 > **Your Claude Code skills might be broken.** Claude silently ignores skills that exceed its [15k char budget](https://blog.fsck.com/2025/12/17/claude-code-skills-not-triggering/). [Check yours →](#skills-testing-claude-code--openai-codex)
 
 ---
 
+## What EvalView Catches
+
+| Regression Type | What It Means | Status |
+|-----------------|---------------|--------|
+| **REGRESSION** | Score dropped — agent got worse | 🔴 Fix before deploy |
+| **TOOLS_CHANGED** | Agent uses different tools now | 🟡 Review before deploy |
+| **OUTPUT_CHANGED** | Same tools, different response | 🟡 Review before deploy |
+| **PASSED** | Matches baseline | 🟢 Ship it |
+
+EvalView runs in CI. When it detects a regression, your deploy fails. You fix it before users see it.
+
+---
+
 ## What is EvalView?
 
-EvalView is a **testing framework for AI agents**.
+EvalView is a **regression testing framework for AI agents**.
 
 It lets you:
 
-- **Write tests in YAML** that describe inputs, expected tools, and acceptance thresholds
-- **Turn real conversations into regression suites** (record → generate tests → re-run on every change)
-- **Gate deployments in CI** on behavior, tool calls, cost, and latency
-- Plug into **LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HTTP agents**, and more
+- **Save golden baselines** — snapshot a working agent run
+- **Detect regressions automatically** — tool changes, output changes, cost spikes, latency spikes
+- **Block bad deploys in CI** — fail the build when behavior regresses
+- Plug into **LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, MCP servers**, and more
 
-Think: _"pytest / Playwright mindset, but for multi-step agents and tool-calling workflows."_
+Think: _"Regression testing for agents. Like screenshot testing, but for behavior."_
 
 > **Note:** LLM-as-judge evaluations are probabilistic. Results may vary between runs. Use [Statistical Mode](#statistical-mode-variance-testing) for reliable pass/fail decisions.
 
@@ -108,13 +154,13 @@ checks:
   hallucination: true
 ```
 
-**Regression detection** — fail if behavior drifts from baseline:
+**Regression detection** — fail if behavior changes from baseline:
 ```bash
 # Save a good run as baseline
 evalview golden save .evalview/results/xxx.json
 
 # Future runs compare against it
-evalview run --diff  # Fails on REGRESSION status
+evalview run --diff  # Fails on REGRESSION or TOOLS_CHANGED
 ```
 
 ---
@@ -356,24 +402,25 @@ evalview run --diff
 
 When you run with `--diff`, EvalView compares every test against its golden baseline and flags:
 
-| Status | What It Means |
-|--------|---------------|
-| **STABLE** | Matches baseline - no action needed |
-| **CHANGED** | Tools changed but output similar - review |
-| **DRIFT** | Output changed but score stable - investigate |
-| **REGRESSION** | Score dropped significantly - fix before deploy |
+| Status | What It Means | Action |
+|--------|---------------|--------|
+| **PASSED** | Matches baseline | 🟢 Ship it |
+| **TOOLS_CHANGED** | Agent uses different tools | 🟡 Review before deploy |
+| **OUTPUT_CHANGED** | Same tools, different response | 🟡 Review before deploy |
+| **REGRESSION** | Score dropped significantly | 🔴 Fix before deploy |
 
 ### Example Output
 
 ```
-━━━ Regression Detection ━━━
+━━━ Golden Diff Report ━━━
 
-✓ test-stock-analysis      STABLE
-⚠ test-customer-support    DRIFT     output similarity: 78%
-✗ test-code-review         REGRESSION  score dropped 15 points
+✓ PASSED           test-stock-analysis
+⚠ TOOLS_CHANGED    test-customer-support    added: web_search
+~ OUTPUT_CHANGED   test-summarizer          similarity: 78%
+✗ REGRESSION       test-code-review         score dropped 15 points
 
-Regressions detected: 1
-Drifts detected: 1
+1 REGRESSION - fix before deploy
+1 TOOLS_CHANGED - review before deploy
 ```
 
 ### Golden Commands
@@ -1171,9 +1218,10 @@ If EvalView caught a regression, saved you debugging time, or kept your agent co
 - [x] Tool categories for flexible matching
 - [x] Multi-run flakiness detection
 - [x] Skills testing (Claude Code, OpenAI Codex)
+- [x] MCP server testing (`adapter: mcp`)
+- [x] HTML diff reports (`--diff-report`)
 
 **Coming Soon:**
-- [ ] MCP server testing
 - [ ] Multi-turn conversation testing
 - [ ] Grounded hallucination checking
 - [ ] LLM-as-judge for skill guideline compliance
diff --git a/evalview/cli.py b/evalview/cli.py
@@ -42,7 +42,16 @@
 @click.group()
 @click.version_option(version="0.1.7")
 def main():
-    """EvalView - Testing framework for multi-step AI agents."""
+    """EvalView - Catch agent regressions before you ship.
+
+    Detects tool changes, output changes, cost spikes, and latency spikes
+    by comparing against golden baselines.
+
+    Quick start:
+      evalview quickstart              # Try it in 2 minutes
+      evalview run --diff              # Compare against golden baseline
+      evalview golden save result.json # Save a working run as baseline
+    """
     pass
 
 
@@ -1268,7 +1277,7 @@ async def _init_wizard_async(dir: str):
 @click.option(
     "--diff",
     is_flag=True,
-    help="Compare against golden traces and show regressions. Use 'evalview golden save' to create baselines.",
+    help="Compare against golden baselines. Shows REGRESSION/TOOLS_CHANGED/OUTPUT_CHANGED/PASSED status.",
 )
 @click.option(
     "--diff-report",
@@ -2422,16 +2431,16 @@ async def update_display():
             console.print("\n[bold cyan]━━━ Golden Diff Report ━━━[/bold cyan]\n")
 
             for test_name, trace_diff in diffs_found:
-                # Status-based display with proper terminology
+                # Status-based display with developer-friendly terminology
                 status = trace_diff.overall_severity
                 if status == DiffStatus.REGRESSION:
                     icon = "[red]✗ REGRESSION[/red]"
-                elif status == DiffStatus.DRIFT:
-                    icon = "[yellow]⚠ DRIFT[/yellow]"
-                elif status == DiffStatus.CHANGED:
-                    icon = "[dim]~ CHANGED[/dim]"
+                elif status == DiffStatus.TOOLS_CHANGED:
+                    icon = "[yellow]⚠ TOOLS_CHANGED[/yellow]"
+                elif status == DiffStatus.OUTPUT_CHANGED:
+                    icon = "[dim]~ OUTPUT_CHANGED[/dim]"
                 else:
-                    icon = "[green]✓ STABLE[/green]"
+                    icon = "[green]✓ PASSED[/green]"
 
                 console.print(f"{icon} [bold]{test_name}[/bold]")
                 console.print(f"    Summary: {trace_diff.summary()}")
@@ -2454,23 +2463,23 @@ async def update_display():
 
                 console.print()
 
-            # Summary with proper terminology
+            # Summary with developer-friendly terminology
             regressions = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.REGRESSION)
-            drifts = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.DRIFT)
-            changes = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.CHANGED)
+            tools_changed = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.TOOLS_CHANGED)
+            output_changed = sum(1 for _, d in diffs_found if d.overall_severity == DiffStatus.OUTPUT_CHANGED)
 
             if regressions > 0:
-                console.print(f"[red]✗ {regressions} REGRESSION(s) detected! Score dropped - review before deploying.[/red]\n")
-            elif drifts > 0:
-                console.print(f"[yellow]⚠ {drifts} DRIFT(s) detected - output changed but score stable[/yellow]\n")
-            elif changes > 0:
-                console.print(f"[dim]~ {changes} minor change(s) - tools changed but output similar[/dim]\n")
+                console.print(f"[red]✗ {regressions} REGRESSION(s) - score dropped, fix before deploy[/red]\n")
+            elif tools_changed > 0:
+                console.print(f"[yellow]⚠ {tools_changed} TOOLS_CHANGED - agent behavior shifted, review before deploy[/yellow]\n")
+            elif output_changed > 0:
+                console.print(f"[dim]~ {output_changed} OUTPUT_CHANGED - response changed, review before deploy[/dim]\n")
         else:
             # Check if any golden traces exist
             goldens = store.list_golden()
             matched = sum(1 for g in goldens if any(r.test_case == g.test_name for r in results))
             if matched > 0:
-                console.print(f"[green]✓ STABLE - No differences from golden baseline ({matched} tests compared)[/green]\n")
+                console.print(f"[green]✓ PASSED - No differences from golden baseline ({matched} tests compared)[/green]\n")
             elif goldens:
                 console.print("[yellow]No golden traces match these tests[/yellow]")
                 console.print("[dim]Save one with: evalview golden save " + str(results_file) + "[/dim]\n")
diff --git a/evalview/core/diff.py b/evalview/core/diff.py
@@ -19,12 +19,19 @@
 
 
 class DiffStatus(Enum):
-    """Status/category of differences found."""
+    """Status/category of differences found.
 
-    STABLE = "stable"      # No significant differences - matches baseline
-    CHANGED = "changed"    # Minor changes (tools changed but output similar)
-    DRIFT = "drift"        # Output changed but score stable (behavioral drift)
-    REGRESSION = "regression"  # Score dropped significantly - likely a bug
+    Four states with clear developer-friendly terminology:
+    - PASSED: Matches baseline, safe to ship
+    - TOOLS_CHANGED: Different tools used, behavior shifted
+    - OUTPUT_CHANGED: Same tools, different response
+    - REGRESSION: Score dropped, something got worse
+    """
+
+    PASSED = "passed"                # No significant differences - matches baseline
+    TOOLS_CHANGED = "tools_changed"  # Tools changed (agent behavior shifted)
+    OUTPUT_CHANGED = "output_changed"  # Output changed but score stable
+    REGRESSION = "regression"        # Score dropped significantly - likely a bug
 
 
 # Alias for backwards compatibility
@@ -133,34 +140,31 @@ def compare(
         # Calculate latency diff
         latency_diff = actual.metrics.total_latency - golden.trace.metrics.total_latency
 
-        # Determine overall status using proper terminology:
-        # - REGRESSION: score dropped significantly (>5 points)
-        # - DRIFT: output changed (>20%) but score is stable
-        # - CHANGED: tools changed but output similar
-        # - STABLE: matches baseline
+        # Determine overall status:
+        # - REGRESSION: score dropped significantly (>5 points) - fix before deploy
+        # - TOOLS_CHANGED: different tools used - review before deploy
+        # - OUTPUT_CHANGED: same tools, different response - review before deploy
+        # - PASSED: matches baseline - safe to ship
 
         has_tool_changes = bool(tool_diffs)
-        has_output_drift = output_diff.similarity < 0.80
-        has_minor_output_change = output_diff.similarity < 0.95
+        has_output_change = output_diff.similarity < 0.95
+        has_significant_output_change = output_diff.similarity < 0.80
         score_dropped = score_diff < -5
 
-        has_differences = has_tool_changes or has_minor_output_change
+        has_differences = has_tool_changes or has_output_change
 
         if score_dropped:
-            # Score dropped significantly - this is a REGRESSION
+            # Score dropped significantly - REGRESSION
             overall_severity = DiffStatus.REGRESSION
-        elif has_output_drift:
-            # Output changed significantly but score stable - DRIFT
-            overall_severity = DiffStatus.DRIFT
-        elif has_tool_changes and not has_minor_output_change:
-            # Only tools changed, output similar - minor CHANGED
-            overall_severity = DiffStatus.CHANGED
-        elif has_minor_output_change:
-            # Small output change - DRIFT (less severe)
-            overall_severity = DiffStatus.DRIFT
+        elif has_tool_changes:
+            # Tools changed - TOOLS_CHANGED (behavior shifted)
+            overall_severity = DiffStatus.TOOLS_CHANGED
+        elif has_output_change:
+            # Output changed but same tools - OUTPUT_CHANGED
+            overall_severity = DiffStatus.OUTPUT_CHANGED
         else:
-            # No significant differences - STABLE
-            overall_severity = DiffStatus.STABLE
+            # No significant differences - PASSED
+            overall_severity = DiffStatus.PASSED
 
         return TraceDiff(
             test_name=golden.metadata.test_name,
@@ -196,7 +200,7 @@ def _compare_tools(
                             position=g_start + i,
                             golden_tool=g,
                             actual_tool=a,
-                            severity=DiffStatus.CHANGED,
+                            severity=DiffStatus.TOOLS_CHANGED,
                             message=f"Tool changed: '{g}' -> '{a}' at step {g_start + i + 1}",
                         )
                     )
@@ -210,7 +214,7 @@ def _compare_tools(
                             position=g_start + i,
                             golden_tool=g,
                             actual_tool=None,
-                            severity=DiffStatus.DRIFT,  # Missing tool is drift
+                            severity=DiffStatus.TOOLS_CHANGED,  # Missing tool = behavior shifted
                             message=f"Tool removed: '{g}' was at step {g_start + i + 1}",
                         )
                     )
@@ -224,7 +228,7 @@ def _compare_tools(
                             position=a_start + i,
                             golden_tool=None,
                             actual_tool=a,
-                            severity=DiffStatus.STABLE,  # Added tools are often OK
+                            severity=DiffStatus.TOOLS_CHANGED,  # Added tool = behavior shifted
                             message=f"Tool added: '{a}' at step {a_start + i + 1}",
                         )
                     )
@@ -253,9 +257,9 @@ def _compare_outputs(
 
         # Determine severity (used internally, overall status determined in compare())
         if similarity >= 0.95:
-            severity = DiffStatus.STABLE
+            severity = DiffStatus.PASSED
         elif similarity >= 0.8:
-            severity = DiffStatus.DRIFT
+            severity = DiffStatus.OUTPUT_CHANGED
         else:
             severity = DiffStatus.REGRESSION
 
diff --git a/evalview/reporters/html_reporter.py b/evalview/reporters/html_reporter.py