|
4 | 4 | <p align="center"> |
5 | 5 | <img src="assets/logo.png" alt="EvalView" width="350"> |
6 | 6 | <br> |
7 | | - <strong>Regression testing for AI agents.</strong><br> |
8 | | - AI agent testing for CI/CD: generate tests, snapshot behavior, detect regressions, and block broken tool-calling agents before production. |
| 7 | + <strong>Regression guardrails for agents.</strong><br> |
| 8 | + Generate tests, snapshot behavior, and catch silent regressions in CI before they hit production. |
9 | 9 | </p> |
10 | 10 |
|
11 | 11 | <p align="center"> |
|
23 | 23 |
|
24 | 24 | --- |
25 | 25 |
|
26 | | -EvalView is an **AI agent testing** and **LLM agent regression testing** tool for teams shipping tool-calling agents. It helps you: |
| 26 | +EvalView is a **regression guardrail** for teams shipping agents, especially tool-calling agents. It helps you: |
27 | 27 | - generate your first agent test suite from a URL or traffic logs |
28 | 28 | - snapshot a golden baseline for agent behavior |
29 | | -- detect tool-call, sequence, output, cost, and latency regressions |
| 29 | +- detect tool-call, sequence, output, cost, latency, and multi-turn regressions |
30 | 30 | - run AI agent tests in CI/CD before shipping changes |
31 | 31 |
|
32 | 32 | Use EvalView when you need: |
@@ -75,6 +75,24 @@ The first two layers alone catch most regressions — fully offline, zero cost. |
75 | 75 |
|
76 | 76 | **Your data stays local.** Nothing is sent to EvalView servers — all processing happens on your machine. |
77 | 77 |
|
| 78 | +### Multi-turn regressions are first-class |
| 79 | + |
| 80 | +EvalView does not stop at single prompt/output checks. It can catch regressions where an agent skips a clarification question, asks the wrong follow-up, or takes the wrong tool path on turn two. |
| 81 | + |
| 82 | +```yaml |
| 83 | +tests: |
| 84 | + - name: refund_flow_requires_clarification |
| 85 | + conversation: |
| 86 | + - user: "I want a refund" |
| 87 | + expected: |
| 88 | + assistant_contains: ["order number"] |
| 89 | + - user: "Order 4812" |
| 90 | + expected: |
| 91 | + tools_called: ["lookup_order", "check_policy"] |
| 92 | +``` |
| 93 | +
|
| 94 | +That matters because many real agent failures happen after the first turn, when the agent has to remember context, ask a clarifying question, or decide whether to act. |
| 95 | +
|
78 | 96 | ### The workflow |
79 | 97 |
|
80 | 98 | ```bash |
|
0 commit comments