Skip to content

Releases: hidai25/eval-view

v0.5.3

18 Mar 22:10

Choose a tag to compare

HTML Report Redesign

Overview tab — Compact KPI strip replaces 6-card hero grid. Removed duplicate Agent Model/Token Usage cards and Distribution donut. Full-width score chart. No-judge notice when hallucination/safety checks are skipped.

Execution Trace tab — Adaptive collapse: ≤4 tests all expanded, 5+ only first. Larger chevron buttons.

Diffs tab — Collapsible items (passed collapsed, changes expanded). Removed duplicate tool tags. Lazy-rendered trajectory diagrams behind toggle. Baseline→current score display (86.0 → 87.5 +1.5). Tooltips on lexical/semantic similarity.

Timeline tab — KPI summary strip. Side-by-side latency + cost charts. Color-coded bars by test.

All tabs — Larger Mermaid diagram fonts. Removed SVG max-height cap.

v0.5.2

17 Mar 08:57

Choose a tag to compare

What's New in v0.5.2

Cold-Start Test Generation (evalview generate)

  • Production-grade test generation from live agent probing — no manual YAML writing needed
  • Interactive probe budget and model selection
  • Multi-turn conversation tests generated as single cohesive test cases
  • Domain-aware draft generation with coherence filtering
  • --synth-model flag to override the synthesis model
  • Real-time elapsed timer during probe runs
  • Delta reporting: shows changes since last generation

Improved Reports

  • Model and token usage displayed in HTML reports
  • Judge cost tracking surfaced in check reports
  • Per-query model shown in trace cost breakdown
  • Cleaner baseline metadata and timeline in check reports
  • Turn-level details with clickable chevrons in multi-turn traces

Better Onboarding (evalview init)

  • Remembers active test suite for plain snapshot and check
  • Auto-approves generated drafts with scoped snapshot guidance
  • Detects local agents on /execute and /health endpoints
  • Refreshes stale config when a live agent is detected

Check Command Improvements

  • Shows last baseline snapshot timestamp
  • Auto-generates local HTML report on failures
  • Streamlined regression demo flow

Model Support

  • GPT-5 family model support (gpt-5.4, gpt-5.4-mini)
  • Interactive model selection from available providers

Multi-Turn & Monitoring

  • Multi-turn golden baselines with per-turn tool sequences
  • Cost/latency spike alerts in monitor mode
  • Batch edge-case expansion for test coverage

Bug Fixes

  • Fix multi-turn filter — different output is meaningful regardless of tools
  • Fix probe progress for skipped follow-ups
  • Predictable timing — 1 discovery, multi-turn counts against budget
  • Always show agent model in run output
  • Eliminate duplicate multi-turn tests
  • Silence Ollama JSON fallback warnings in normal runs

Docs

  • Trimmed README from 1420 to 274 lines — details moved to dedicated docs
  • Comparison docs and SEO content added

v0.5.1

13 Mar 20:13

Choose a tag to compare

What's New

Added

  • evalview generate — draft test suite generation from agent probing or log imports, with approval gating and CI review flow
  • Approval workflow — generated tests require explicit approval before becoming baselines
  • CI review commentsevalview ci comment posts generation reports on PRs

Fixed

  • Python 3.9 compatibility: replaced datetime.UTC with timezone.utc
  • Mypy type errors in generate command and test generation module
  • Codebase refactor and cleanup across 71 files

Full Changelog: v0.5.0...v0.5.1

v0.5.0 — Production Monitoring

12 Mar 12:05

Choose a tag to compare

What's New

Production Monitoring (evalview monitor)

  • Continuous regression detection — runs evalview check in a loop with configurable interval (default: 5 min)
  • Slack alerts — webhook notifications on new regressions, recovery notifications when resolved
  • Smart dedup — only alerts on NEW failures, no re-alerts on persistent issues
  • JSONL history export--history monitor.jsonl appends cycle data for trend analysis and dashboards
  • Graceful shutdown — Ctrl+C stops cleanly with cost summary
  • Config support — CLI flags, config.yaml, or EVALVIEW_SLACK_WEBHOOK env var
evalview monitor                                         # Check every 5 min
evalview monitor --interval 60                           # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl                 # Save trends

Community Contributions

  • CSV exportevalview check --csv results.csv (@muhammadrashid4587)
  • Timeout flagevalview check --timeout 60 (@zamadye)
  • Better errors — human-friendly connection failure messages (@passionworkeer)
  • JSONL history--history flag for monitor (@clawtom)

Bug Fixes & Refactoring

  • Fixed severity comparison bug (was using string matching instead of enum comparison)
  • Fixed JSONL history pass count (was using fail_on filter instead of actual counts)
  • Extracted shared _parse_fail_statuses utility for consistent fail_on parsing
  • Eliminated redundant config loading in monitor loop

Deployment

# Quick background run
nohup evalview monitor --slack-webhook https://... &

# Docker
docker run -d -v $(pwd):/app -w /app evalview monitor --slack-webhook https://...

Full Changelog: v0.4.1...v0.5.0

v0.4.1

09 Mar 09:47

Choose a tag to compare

What's New

Mistral Adapter

  • Direct Mistral API support via pip install evalview[mistral]
  • Lazy import — no dependency unless you use it

PII Evaluator

  • Opt-in detection for emails, phones, SSNs, credit cards, addresses
  • Luhn algorithm validation for credit cards to reduce false positives
  • Enable with checks: { pii: true } in test YAML

Multi-Turn HTML Reports

  • Mermaid sequence diagrams showing conversation turns with tool calls
  • Per-turn query and tool breakdown in the Execution Trace tab

Security

  • GitHub Action: replaced eval $CMD with bash arrays, moved inputs to env vars
  • Mermaid diagrams: fixed autoescape breaking arrows, sanitized user content

README

  • New hero section with logo, sequence diagram screenshot, data flow diagram
  • "Your data stays local" privacy explanation
  • Updated model version examples to Claude 4.5/4.6

Full Changelog: v0.4.0...v0.4.1

v0.4.0 — Multi-turn testing, A/B comparison, Cloud sync

05 Mar 10:55

Choose a tag to compare

What's new in 0.4.0

Multi-turn conversation testing

Test stateful, multi-step conversations with the new turns: YAML field. Each turn gets the accumulated conversation history injected automatically.

name: flight-booking-conversation
turns:
  - query: "I want to fly from NYC to Paris next Friday"
    expected:
      tools: [search_flights]
  - query: "Book the cheapest economy option"
    expected:
      tools: [book_flight]
      output:
        contains: ["confirmed", "Paris"]
  - query: "Send me a confirmation email"
    expected:
      tools: [send_email]
expected:
  tools: [search_flights, book_flight, send_email]
thresholds:
  min_score: 80

A/B endpoint comparison

Run the same test suite against two endpoints and get a per-test verdict table.

evalview compare \
  --v1 http://prod.internal/invoke --label-v1 "gpt-4o (prod)" \
  --v2 http://staging.internal/invoke --label-v2 "claude-sonnet (staging)" \
  --tests tests/

Cloud baseline sync

evalview login      # OAuth sign-in
evalview snapshot   # baselines auto-sync to cloud
evalview check      # teammates pull your baselines automatically

Other highlights

  • evalview capture — HTTP proxy records real agent traffic as test YAMLs
  • evalview install-hooks — inject regression checks into git pre-push
  • Silent model update detection — alerts when provider swaps model behind same API name
  • Gradual drift detection — OLS regression over 10-check window
  • Semantic diff--semantic-diff scores by meaning, not character similarity
  • Auto-open HTML report after every evalview run
  • evalview init now auto-detects your agent endpoint and generates starter tests
  • Test quality gating — low-quality generated tests are skipped, not silently polluting scores
  • mypy clean — 0 errors across 109 source files

Community contributions


Full changelog: CHANGELOG.md

v0.3.2 — Fix nested claude auth for MCP users

27 Feb 11:10

Choose a tag to compare

What's fixed

claude-code adapter: auth failure in MCP context

The adapter was failing immediately (~3-4s) with "Invalid API key" when invoked through the MCP chain. Root cause: Claude Code sets ANTHROPIC_API_KEY to a session-scoped token in its subprocess environment, which the inner claude --print inherited and the Anthropic API rejected.

Fix: Strip ANTHROPIC_API_KEY from the adapter's env so the inner claude falls back to ~/.claude.json credentials (stored by claude auth login).

custom adapter: works for OAuth users (no API key needed)

The demo runner.py used the Anthropic SDK directly, which requires ANTHROPIC_API_KEY. Claude Code OAuth users don't have this env var set.

Fix: Rewrote runner to use claude --print subprocess (same auth path as the claude-code adapter).

MCP server: skill test timeout raised to 600s

Multi-test suites (10 tests × ~15s each) were hitting the previous 120s timeout.

Other improvements

  • Non-interactive mode for generate-tests (--auto / no TTY)
  • Better first-snapshot and first-check celebration panels with CI integration steps
  • 60s asyncio timeout on LLM calls in test generator
  • Actionable hints when skill dependencies (e.g. mcporter) are missing

v0.3.0 — Claude Code MCP + Skills Testing + Telemetry

20 Feb 10:41

Choose a tag to compare

What's New in 0.3

🤖 Claude Code MCP Integration

EvalView now runs as an MCP server inside Claude Code — test your agent without leaving the conversation.

claude mcp add --transport stdio evalview -- evalview mcp serve
cp CLAUDE.md.example CLAUDE.md

7 MCP tools available:

Tool What it does
create_test Generate test cases from natural language
run_snapshot Capture golden baseline
run_check Detect regressions inline
list_tests Show all baselines
validate_skill Validate SKILL.md structure
generate_skill_tests Auto-generate skill test suite
run_skill_test Run Phase 1 (deterministic) + Phase 2 (rubric)

📊 Telemetry Improvements

  • Users now show as EvalView-3f8a2b instead of raw UUIDs in PostHog
  • Session duration tracking (session_duration_ms)
  • Set EVALVIEW_DEV=1 to tag your own events for filtering

🐕 Dogfood Regression Testing

EvalView now tests itself using its own evaluation logic on every CI run.

Bug Fixes

  • Fixed PIPESTATUS CI bug (regression checks now correctly fail CI)
  • Fixed deprecated asyncio.get_event_loop()get_running_loop()
  • Fixed silent failures in --json mode
  • ANSI escape stripping improved in MCP output

Upgrade

pip install --upgrade evalview

v0.2.9 — Clean MCP output

19 Feb 06:39

Choose a tag to compare

Bug Fix

  • Strip ANSI escape codes from MCP tool output so list_tests and run_snapshot return clean text to Claude Code

Upgrade

```bash
pip install --upgrade evalview
```

v0.2.8 — snapshot/check end-to-end fixes

19 Feb 06:33

Choose a tag to compare

Bug Fixes

  • Fixed adapter.run()adapter.execute() in snapshot/check code paths (method didn't exist)
  • Fixed Evaluator.evaluate() called without await — coroutine was never running
  • Fixed _create_adapter not passing allow_private_urls to HTTPAdapter (localhost blocked)
  • Version now read dynamically from package metadata — no more hardcoded strings

Result

evalview snapshot and evalview check now work end-to-end against real agents. Tested against the mock agent at 100/100.

Upgrade

```bash
pip install --upgrade evalview
```