Skip to content

Commit 189222a

Browse files
committed
feat(generate): ship draft suite generation, approval gating, and CI review flow
1 parent 209f6c6 commit 189222a

16 files changed

+2071
-131
lines changed

README.md

Lines changed: 43 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,9 @@ The first two layers alone catch most regressions — fully offline, zero cost.
6565
### The workflow
6666

6767
```bash
68-
evalview capture --agent http://localhost:8000/invoke # 1. Record real interactions
69-
evalview snapshot # 2. Save as baseline
70-
evalview check # 3. Catch regressions
68+
evalview generate --agent http://localhost:8000 # 1. Draft a regression suite
69+
evalview snapshot tests/generated --approve-generated # 2. Approve + baseline
70+
evalview check tests/generated # 3. Catch regressions
7171
evalview monitor # 4. Watch continuously (+ Slack alerts)
7272
# ✅ All clean — or ❌ REGRESSION: score 85 → 71
7373
```
@@ -76,7 +76,9 @@ evalview monitor # 4. Watch continuously
7676

7777
Choose the shortest path for your use case:
7878

79-
- New project: `evalview capture --agent ...``evalview snapshot``evalview check`
79+
- New project, no traffic yet: `evalview generate --agent ...``evalview snapshot --approve-generated``evalview check`
80+
- Existing traffic or staging logs: `evalview generate --from-log traffic.jsonl`
81+
- Production-shaped tests from real usage: `evalview capture --agent ...``evalview snapshot``evalview check`
8082
- Existing tests, no baselines yet: `evalview snapshot`
8183
- CI gate for regressions: [Golden Traces](docs/GOLDEN_TRACES.md) and [CI/CD Integration](docs/CI_CD.md)
8284
- Framework-specific setup: [Framework Support](docs/FRAMEWORK_SUPPORT.md)
@@ -245,7 +247,23 @@ evalview check --semantic-diff
245247
pip install evalview
246248
```
247249

248-
### Step 1 — Capture real interactions as tests
250+
### Step 1 — Generate or capture tests
251+
252+
If you have no test suite yet, start with generation:
253+
254+
```bash
255+
evalview generate --agent http://localhost:8000
256+
# Writes draft YAML tests to tests/generated/
257+
# Also writes tests/generated/generated.report.json for CI review
258+
```
259+
260+
If you already have logs from staging or production:
261+
262+
```bash
263+
evalview generate --from-log traffic.jsonl
264+
```
265+
266+
If you want tests based on real user flows instead of planned probes:
249267

250268
```bash
251269
evalview capture --agent http://localhost:8000/invoke
@@ -254,9 +272,19 @@ evalview capture --agent http://localhost:8000/invoke
254272
# Tests are saved to tests/test-cases/ automatically
255273
```
256274

257-
> **Why capture first?** Tests from real usage catch real regressions. Auto-generated tests from guessed queries score poorly and give you false confidence.
275+
> **When to use which?**
276+
> `generate` is the fastest path from zero to a draft suite.
277+
> `capture` is the highest-signal path when you already have real usage to replay.
278+
279+
### Step 2 — Review and save as your baseline
258280

259-
### Step 2 — Save as your baseline
281+
Generated tests are draft-only until you approve them:
282+
283+
```bash
284+
evalview snapshot tests/generated --approve-generated
285+
```
286+
287+
Captured or hand-written tests snapshot normally:
260288

261289
```bash
262290
export OPENAI_API_KEY='your-key' # for LLM-as-judge scoring
@@ -269,6 +297,14 @@ evalview snapshot
269297
evalview check # run this after every change
270298
```
271299

300+
### Review generated suites in CI
301+
302+
```bash
303+
evalview ci comment --results tests/generated/generated.report.json --dry-run
304+
```
305+
306+
That review comment summarizes discovered tools, generated behavior paths, coverage gaps, and the approval workflow before baselining.
307+
272308
### No agent yet? Try the demo
273309

274310
```bash

docs/CI_CD.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,34 @@
66
77
EvalView is CLI-first. You can run it locally or add to CI.
88

9+
## Review Generated Suites in PRs
10+
11+
If you use `evalview generate`, every run writes a machine-readable suite report:
12+
13+
```bash
14+
tests/generated/generated.report.json
15+
```
16+
17+
Turn that into a PR comment:
18+
19+
```bash
20+
evalview ci comment --results tests/generated/generated.report.json
21+
```
22+
23+
The generated-suite comment includes:
24+
- discovered tools
25+
- draft behavior paths
26+
- coverage gaps
27+
- approval instructions for `snapshot --approve-generated`
28+
29+
Recommended flow:
30+
31+
```bash
32+
evalview generate --agent http://localhost:8000
33+
evalview ci comment --results tests/generated/generated.report.json
34+
evalview snapshot tests/generated --approve-generated
35+
```
36+
937
---
1038

1139
## GitHub Action (Recommended)

docs/CLI_REFERENCE.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ Options:
110110
-t, --test TEXT Snapshot only this specific test
111111
-n, --notes TEXT Notes about this snapshot
112112
--variant TEXT Save as named variant (max 5 per test)
113+
--approve-generated Approve generated draft tests before snapshotting
113114
```
114115
115116
### Examples
@@ -118,6 +119,7 @@ Options:
118119
evalview snapshot # Snapshot all passing tests
119120
evalview snapshot --test "my-test" # Snapshot one test
120121
evalview snapshot --variant v2 # Save alternate acceptable behavior
122+
evalview snapshot tests/generated --approve-generated
121123
```
122124
123125
---
@@ -150,6 +152,43 @@ evalview check --dry-run # Preview plan, no API calls
150152
evalview check --budget 0.50 # Cap spend at $0.50
151153
```
152154
155+
## `evalview generate`
156+
157+
Generate a draft regression suite from a live agent or existing traffic logs.
158+
159+
```bash
160+
evalview generate [OPTIONS]
161+
162+
Options:
163+
--agent URL Agent endpoint URL
164+
--adapter TEXT Adapter type (default: config or http)
165+
--budget N Maximum probe runs / imported entries
166+
--out DIR Output directory (default: tests/generated)
167+
--seed FILE Newline-delimited seed prompts
168+
--from-log PATH Generate from a log file instead of live probing
169+
--log-format FORMAT auto|jsonl|openai|evalview
170+
--include-tools TEXT Comma-separated tool names to focus on
171+
--exclude-tools TEXT Comma-separated tool names to avoid
172+
--allow-live-side-effects Allow side-effecting prompts
173+
--timeout FLOAT Probe timeout in seconds
174+
--dry-run Preview without writing files
175+
```
176+
177+
### Examples
178+
179+
```bash
180+
evalview generate --agent http://localhost:8000
181+
evalview generate --from-log traffic.jsonl
182+
evalview generate --agent http://localhost:8000 --include-tools search,calendar
183+
evalview generate --dry-run
184+
```
185+
186+
Generated suites are draft-only until approved:
187+
188+
```bash
189+
evalview snapshot tests/generated --approve-generated
190+
```
191+
153192
---
154193
155194
## `evalview expand`

docs/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ If you're new:
1515
| I want to… | Read this first | Then |
1616
|------------|-----------------|------|
1717
| Get EvalView running quickly | [Getting Started](GETTING_STARTED.md) | [CLI Reference](CLI_REFERENCE.md) |
18+
| Go from zero tests to a draft suite | [Test Generation](TEST_GENERATION.md) | [CI/CD Integration](CI_CD.md) |
1819
| Understand regression detection | [Golden Traces](GOLDEN_TRACES.md) | [Evaluation Metrics](EVALUATION_METRICS.md) |
1920
| Test a specific framework | [Framework Support](FRAMEWORK_SUPPORT.md) | the matching quick start below |
2021
| Set up CI/CD | [CI/CD Integration](CI_CD.md) | [Golden Traces](GOLDEN_TRACES.md) |
@@ -41,7 +42,7 @@ If you're new:
4142
| [Suite Types](SUITE_TYPES.md) | Separate capability tests from regression tests |
4243
| [Behavior Coverage](BEHAVIOR_COVERAGE.md) | Track gaps in the behaviors you test |
4344
| [Cost Tracking](COST_TRACKING.md) | Understand token and dollar usage |
44-
| [Test Generation](TEST_GENERATION.md) | Expand a seed test into broader coverage |
45+
| [Test Generation](TEST_GENERATION.md) | Generate a draft suite from an agent or logs |
4546
| [Trace Specification](TRACE_SPEC.md) | Execution trace format used across adapters |
4647

4748
## Frameworks

0 commit comments

Comments
 (0)