You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<strong>Regression guardrails for agents.</strong><br>
8
-
Generate tests, snapshot behavior, and catch silent regressions in CI before they hit production.
8
+
Generate tests, snapshot tool use and multi-turn behavior, and catch silent regressions in CI before they hit production.
9
9
</p>
10
10
11
11
<palign="center">
@@ -36,6 +36,8 @@ Use EvalView when you need:
36
36
-**golden baseline testing for agents**
37
37
-**MCP server regression testing**
38
38
39
+
Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class of failures: the agent still returns `200`, but it stops asking the clarification question, takes the wrong tool path on turn two, or silently changes output quality after a model or prompt update.
40
+
39
41
<palign="center">
40
42
<imgsrc="assets/hero.jpg"alt="EvalView — multi-turn execution trace with sequence diagram"width="860">
41
43
<br>
@@ -55,6 +57,26 @@ EvalView sends test queries to your agent's API and records everything: which to
55
57
Score: 85 → 55 Output similarity: 35%
56
58
```
57
59
60
+
### Multi-turn regressions are first-class
61
+
62
+
Many real failures are not single-turn failures. They happen when an agent should clarify, remember context, or act on a follow-up.
63
+
64
+
```yaml
65
+
name: refund-needs-order-number
66
+
turns:
67
+
- query: "I want a refund"
68
+
expected:
69
+
output:
70
+
contains: ["order number"]
71
+
- query: "Order 4812"
72
+
expected:
73
+
tools: ["lookup_order", "check_policy"]
74
+
thresholds:
75
+
min_score: 70
76
+
```
77
+
78
+
If the agent stops asking for the order number, skips straight to the wrong action, or takes a different tool path on the follow-up turn, EvalView flags it.
79
+
58
80
**Four scoring layers, each one optional:**
59
81
60
82
| Layer | What it checks | Needs API key? | Cost |
@@ -75,24 +97,6 @@ The first two layers alone catch most regressions — fully offline, zero cost.
75
97
76
98
**Your data stays local.** Nothing is sent to EvalView servers — all processing happens on your machine.
77
99
78
-
### Multi-turn regressions are first-class
79
-
80
-
EvalView does not stop at single prompt/output checks. It can catch regressions where an agent skips a clarification question, asks the wrong follow-up, or takes the wrong tool path on turn two.
81
-
82
-
```yaml
83
-
tests:
84
-
- name: refund_flow_requires_clarification
85
-
conversation:
86
-
- user: "I want a refund"
87
-
expected:
88
-
assistant_contains: ["order number"]
89
-
- user: "Order 4812"
90
-
expected:
91
-
tools_called: ["lookup_order", "check_policy"]
92
-
```
93
-
94
-
That matters because many real agent failures happen after the first turn, when the agent has to remember context, ask a clarifying question, or decide whether to act.
95
-
96
100
### The workflow
97
101
98
102
```bash
@@ -120,6 +124,8 @@ evalview snapshot
120
124
evalview check
121
125
```
122
126
127
+
That starter flow can cover single-turn checks, clarification turns, and multi-turn follow-ups against the same baseline.
0 commit comments