Releases: hidai25/eval-view
v0.5.3
HTML Report Redesign
Overview tab — Compact KPI strip replaces 6-card hero grid. Removed duplicate Agent Model/Token Usage cards and Distribution donut. Full-width score chart. No-judge notice when hallucination/safety checks are skipped.
Execution Trace tab — Adaptive collapse: ≤4 tests all expanded, 5+ only first. Larger chevron buttons.
Diffs tab — Collapsible items (passed collapsed, changes expanded). Removed duplicate tool tags. Lazy-rendered trajectory diagrams behind toggle. Baseline→current score display (86.0 → 87.5 +1.5). Tooltips on lexical/semantic similarity.
Timeline tab — KPI summary strip. Side-by-side latency + cost charts. Color-coded bars by test.
All tabs — Larger Mermaid diagram fonts. Removed SVG max-height cap.
v0.5.2
What's New in v0.5.2
Cold-Start Test Generation (evalview generate)
- Production-grade test generation from live agent probing — no manual YAML writing needed
- Interactive probe budget and model selection
- Multi-turn conversation tests generated as single cohesive test cases
- Domain-aware draft generation with coherence filtering
--synth-modelflag to override the synthesis model- Real-time elapsed timer during probe runs
- Delta reporting: shows changes since last generation
Improved Reports
- Model and token usage displayed in HTML reports
- Judge cost tracking surfaced in check reports
- Per-query model shown in trace cost breakdown
- Cleaner baseline metadata and timeline in check reports
- Turn-level details with clickable chevrons in multi-turn traces
Better Onboarding (evalview init)
- Remembers active test suite for plain
snapshotandcheck - Auto-approves generated drafts with scoped snapshot guidance
- Detects local agents on
/executeand/healthendpoints - Refreshes stale config when a live agent is detected
Check Command Improvements
- Shows last baseline snapshot timestamp
- Auto-generates local HTML report on failures
- Streamlined regression demo flow
Model Support
- GPT-5 family model support (gpt-5.4, gpt-5.4-mini)
- Interactive model selection from available providers
Multi-Turn & Monitoring
- Multi-turn golden baselines with per-turn tool sequences
- Cost/latency spike alerts in monitor mode
- Batch edge-case expansion for test coverage
Bug Fixes
- Fix multi-turn filter — different output is meaningful regardless of tools
- Fix probe progress for skipped follow-ups
- Predictable timing — 1 discovery, multi-turn counts against budget
- Always show agent model in run output
- Eliminate duplicate multi-turn tests
- Silence Ollama JSON fallback warnings in normal runs
Docs
- Trimmed README from 1420 to 274 lines — details moved to dedicated docs
- Comparison docs and SEO content added
v0.5.1
What's New
Added
evalview generate— draft test suite generation from agent probing or log imports, with approval gating and CI review flow- Approval workflow — generated tests require explicit approval before becoming baselines
- CI review comments —
evalview ci commentposts generation reports on PRs
Fixed
- Python 3.9 compatibility: replaced
datetime.UTCwithtimezone.utc - Mypy type errors in generate command and test generation module
- Codebase refactor and cleanup across 71 files
Full Changelog: v0.5.0...v0.5.1
v0.5.0 — Production Monitoring
What's New
Production Monitoring (evalview monitor)
- Continuous regression detection — runs
evalview checkin a loop with configurable interval (default: 5 min) - Slack alerts — webhook notifications on new regressions, recovery notifications when resolved
- Smart dedup — only alerts on NEW failures, no re-alerts on persistent issues
- JSONL history export —
--history monitor.jsonlappends cycle data for trend analysis and dashboards - Graceful shutdown — Ctrl+C stops cleanly with cost summary
- Config support — CLI flags,
config.yaml, orEVALVIEW_SLACK_WEBHOOKenv var
evalview monitor # Check every 5 min
evalview monitor --interval 60 # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl # Save trendsCommunity Contributions
- CSV export —
evalview check --csv results.csv(@muhammadrashid4587) - Timeout flag —
evalview check --timeout 60(@zamadye) - Better errors — human-friendly connection failure messages (@passionworkeer)
- JSONL history —
--historyflag for monitor (@clawtom)
Bug Fixes & Refactoring
- Fixed severity comparison bug (was using string matching instead of enum comparison)
- Fixed JSONL history pass count (was using fail_on filter instead of actual counts)
- Extracted shared
_parse_fail_statusesutility for consistent fail_on parsing - Eliminated redundant config loading in monitor loop
Deployment
# Quick background run
nohup evalview monitor --slack-webhook https://... &
# Docker
docker run -d -v $(pwd):/app -w /app evalview monitor --slack-webhook https://...Full Changelog: v0.4.1...v0.5.0
v0.4.1
What's New
Mistral Adapter
- Direct Mistral API support via
pip install evalview[mistral] - Lazy import — no dependency unless you use it
PII Evaluator
- Opt-in detection for emails, phones, SSNs, credit cards, addresses
- Luhn algorithm validation for credit cards to reduce false positives
- Enable with
checks: { pii: true }in test YAML
Multi-Turn HTML Reports
- Mermaid sequence diagrams showing conversation turns with tool calls
- Per-turn query and tool breakdown in the Execution Trace tab
Security
- GitHub Action: replaced
eval $CMDwith bash arrays, moved inputs to env vars - Mermaid diagrams: fixed autoescape breaking arrows, sanitized user content
README
- New hero section with logo, sequence diagram screenshot, data flow diagram
- "Your data stays local" privacy explanation
- Updated model version examples to Claude 4.5/4.6
Full Changelog: v0.4.0...v0.4.1
v0.4.0 — Multi-turn testing, A/B comparison, Cloud sync
What's new in 0.4.0
Multi-turn conversation testing
Test stateful, multi-step conversations with the new turns: YAML field. Each turn gets the accumulated conversation history injected automatically.
name: flight-booking-conversation
turns:
- query: "I want to fly from NYC to Paris next Friday"
expected:
tools: [search_flights]
- query: "Book the cheapest economy option"
expected:
tools: [book_flight]
output:
contains: ["confirmed", "Paris"]
- query: "Send me a confirmation email"
expected:
tools: [send_email]
expected:
tools: [search_flights, book_flight, send_email]
thresholds:
min_score: 80A/B endpoint comparison
Run the same test suite against two endpoints and get a per-test verdict table.
evalview compare \
--v1 http://prod.internal/invoke --label-v1 "gpt-4o (prod)" \
--v2 http://staging.internal/invoke --label-v2 "claude-sonnet (staging)" \
--tests tests/Cloud baseline sync
evalview login # OAuth sign-in
evalview snapshot # baselines auto-sync to cloud
evalview check # teammates pull your baselines automaticallyOther highlights
evalview capture— HTTP proxy records real agent traffic as test YAMLsevalview install-hooks— inject regression checks into git pre-push- Silent model update detection — alerts when provider swaps model behind same API name
- Gradual drift detection — OLS regression over 10-check window
- Semantic diff —
--semantic-diffscores by meaning, not character similarity - Auto-open HTML report after every
evalview run - evalview init now auto-detects your agent endpoint and generates starter tests
- Test quality gating — low-quality generated tests are skipped, not silently polluting scores
- mypy clean — 0 errors across 109 source files
Community contributions
- Pydantic field validation for
TestCase(#54 by @illbeurs) - Edge tests for
CostEvaluatorandLatencyEvaluator(#55 by @illbeurs) health_check()onOllamaAdapter(#57 by @gauravxthakur)ConsoleReporterdocstrings (#56 by @gauravxthakur)
Full changelog: CHANGELOG.md
v0.3.2 — Fix nested claude auth for MCP users
What's fixed
claude-code adapter: auth failure in MCP context
The adapter was failing immediately (~3-4s) with "Invalid API key" when invoked through the MCP chain. Root cause: Claude Code sets ANTHROPIC_API_KEY to a session-scoped token in its subprocess environment, which the inner claude --print inherited and the Anthropic API rejected.
Fix: Strip ANTHROPIC_API_KEY from the adapter's env so the inner claude falls back to ~/.claude.json credentials (stored by claude auth login).
custom adapter: works for OAuth users (no API key needed)
The demo runner.py used the Anthropic SDK directly, which requires ANTHROPIC_API_KEY. Claude Code OAuth users don't have this env var set.
Fix: Rewrote runner to use claude --print subprocess (same auth path as the claude-code adapter).
MCP server: skill test timeout raised to 600s
Multi-test suites (10 tests × ~15s each) were hitting the previous 120s timeout.
Other improvements
- Non-interactive mode for
generate-tests(--auto/ no TTY) - Better first-snapshot and first-check celebration panels with CI integration steps
- 60s asyncio timeout on LLM calls in test generator
- Actionable hints when skill dependencies (e.g. mcporter) are missing
v0.3.0 — Claude Code MCP + Skills Testing + Telemetry
What's New in 0.3
🤖 Claude Code MCP Integration
EvalView now runs as an MCP server inside Claude Code — test your agent without leaving the conversation.
claude mcp add --transport stdio evalview -- evalview mcp serve
cp CLAUDE.md.example CLAUDE.md7 MCP tools available:
| Tool | What it does |
|---|---|
create_test |
Generate test cases from natural language |
run_snapshot |
Capture golden baseline |
run_check |
Detect regressions inline |
list_tests |
Show all baselines |
validate_skill |
Validate SKILL.md structure |
generate_skill_tests |
Auto-generate skill test suite |
run_skill_test |
Run Phase 1 (deterministic) + Phase 2 (rubric) |
📊 Telemetry Improvements
- Users now show as
EvalView-3f8a2binstead of raw UUIDs in PostHog - Session duration tracking (
session_duration_ms) - Set
EVALVIEW_DEV=1to tag your own events for filtering
🐕 Dogfood Regression Testing
EvalView now tests itself using its own evaluation logic on every CI run.
Bug Fixes
- Fixed PIPESTATUS CI bug (regression checks now correctly fail CI)
- Fixed deprecated
asyncio.get_event_loop()→get_running_loop() - Fixed silent failures in
--jsonmode - ANSI escape stripping improved in MCP output
Upgrade
pip install --upgrade evalviewv0.2.9 — Clean MCP output
Bug Fix
- Strip ANSI escape codes from MCP tool output so
list_testsandrun_snapshotreturn clean text to Claude Code
Upgrade
```bash
pip install --upgrade evalview
```
v0.2.8 — snapshot/check end-to-end fixes
Bug Fixes
- Fixed
adapter.run()→adapter.execute()in snapshot/check code paths (method didn't exist) - Fixed
Evaluator.evaluate()called withoutawait— coroutine was never running - Fixed
_create_adapternot passingallow_private_urlsto HTTPAdapter (localhost blocked) - Version now read dynamically from package metadata — no more hardcoded strings
Result
evalview snapshot and evalview check now work end-to-end against real agents. Tested against the mock agent at 100/100.
Upgrade
```bash
pip install --upgrade evalview
```