Releases · hidai25/eval-view

18 Mar 22:10

hidai25

v0.5.3

b3948b7

v0.5.3 Latest

Latest

HTML Report Redesign

Overview tab — Compact KPI strip replaces 6-card hero grid. Removed duplicate Agent Model/Token Usage cards and Distribution donut. Full-width score chart. No-judge notice when hallucination/safety checks are skipped.

Execution Trace tab — Adaptive collapse: ≤4 tests all expanded, 5+ only first. Larger chevron buttons.

Diffs tab — Collapsible items (passed collapsed, changes expanded). Removed duplicate tool tags. Lazy-rendered trajectory diagrams behind toggle. Baseline→current score display (86.0 → 87.5 +1.5). Tooltips on lexical/semantic similarity.

Timeline tab — KPI summary strip. Side-by-side latency + cost charts. Color-coded bars by test.

All tabs — Larger Mermaid diagram fonts. Removed SVG max-height cap.

Assets 2

17 Mar 08:57

hidai25

v0.5.2

3b10acd

v0.5.2

What's New in v0.5.2

Cold-Start Test Generation (`evalview generate`)

Production-grade test generation from live agent probing — no manual YAML writing needed
Interactive probe budget and model selection
Multi-turn conversation tests generated as single cohesive test cases
Domain-aware draft generation with coherence filtering
--synth-model flag to override the synthesis model
Real-time elapsed timer during probe runs
Delta reporting: shows changes since last generation

Improved Reports

Model and token usage displayed in HTML reports
Judge cost tracking surfaced in check reports
Per-query model shown in trace cost breakdown
Cleaner baseline metadata and timeline in check reports
Turn-level details with clickable chevrons in multi-turn traces

Better Onboarding (`evalview init`)

Remembers active test suite for plain snapshot and check
Auto-approves generated drafts with scoped snapshot guidance
Detects local agents on /execute and /health endpoints
Refreshes stale config when a live agent is detected

Check Command Improvements

Shows last baseline snapshot timestamp
Auto-generates local HTML report on failures
Streamlined regression demo flow

Model Support

GPT-5 family model support (gpt-5.4, gpt-5.4-mini)
Interactive model selection from available providers

Multi-Turn & Monitoring

Multi-turn golden baselines with per-turn tool sequences
Cost/latency spike alerts in monitor mode
Batch edge-case expansion for test coverage

Bug Fixes

Fix multi-turn filter — different output is meaningful regardless of tools
Fix probe progress for skipped follow-ups
Predictable timing — 1 discovery, multi-turn counts against budget
Always show agent model in run output
Eliminate duplicate multi-turn tests
Silence Ollama JSON fallback warnings in normal runs

Docs

Trimmed README from 1420 to 274 lines — details moved to dedicated docs
Comparison docs and SEO content added

Assets 2

13 Mar 20:13

hidai25

v0.5.1

f9d3d06

v0.5.1

What's New

Added

evalview generate — draft test suite generation from agent probing or log imports, with approval gating and CI review flow
Approval workflow — generated tests require explicit approval before becoming baselines
CI review comments — evalview ci comment posts generation reports on PRs

Fixed

Python 3.9 compatibility: replaced datetime.UTC with timezone.utc
Mypy type errors in generate command and test generation module
Codebase refactor and cleanup across 71 files

Full Changelog: v0.5.0...v0.5.1

Assets 2

12 Mar 12:05

hidai25

v0.5.0

42a5c38

v0.5.0 — Production Monitoring

What's New

Production Monitoring (`evalview monitor`)

Continuous regression detection — runs evalview check in a loop with configurable interval (default: 5 min)
Slack alerts — webhook notifications on new regressions, recovery notifications when resolved
Smart dedup — only alerts on NEW failures, no re-alerts on persistent issues
JSONL history export — --history monitor.jsonl appends cycle data for trend analysis and dashboards
Graceful shutdown — Ctrl+C stops cleanly with cost summary
Config support — CLI flags, config.yaml, or EVALVIEW_SLACK_WEBHOOK env var

evalview monitor                                         # Check every 5 min
evalview monitor --interval 60                           # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl                 # Save trends

Community Contributions

CSV export — evalview check --csv results.csv (@muhammadrashid4587)
Timeout flag — evalview check --timeout 60 (@zamadye)
Better errors — human-friendly connection failure messages (@passionworkeer)
JSONL history — --history flag for monitor (@clawtom)

Bug Fixes & Refactoring

Fixed severity comparison bug (was using string matching instead of enum comparison)
Fixed JSONL history pass count (was using fail_on filter instead of actual counts)
Extracted shared _parse_fail_statuses utility for consistent fail_on parsing
Eliminated redundant config loading in monitor loop

Deployment

# Quick background run
nohup evalview monitor --slack-webhook https://... &

# Docker
docker run -d -v $(pwd):/app -w /app evalview monitor --slack-webhook https://...

Full Changelog: v0.4.1...v0.5.0

Contributors

zamadye, muhammadrashid4587, and 2 other contributors

Assets 2

09 Mar 09:47

hidai25

v0.4.1

4ad61d3

v0.4.1

What's New

Mistral Adapter

Direct Mistral API support via pip install evalview[mistral]
Lazy import — no dependency unless you use it

PII Evaluator

Opt-in detection for emails, phones, SSNs, credit cards, addresses
Luhn algorithm validation for credit cards to reduce false positives
Enable with checks: { pii: true } in test YAML

Multi-Turn HTML Reports

Mermaid sequence diagrams showing conversation turns with tool calls
Per-turn query and tool breakdown in the Execution Trace tab

Security

GitHub Action: replaced eval $CMD with bash arrays, moved inputs to env vars
Mermaid diagrams: fixed autoescape breaking arrows, sanitized user content

README

New hero section with logo, sequence diagram screenshot, data flow diagram
"Your data stays local" privacy explanation
Updated model version examples to Claude 4.5/4.6

Full Changelog: v0.4.0...v0.4.1

Assets 2

05 Mar 10:55

hidai25

v0.4.0

d87cd3d

v0.4.0 — Multi-turn testing, A/B comparison, Cloud sync

What's new in 0.4.0

Multi-turn conversation testing

Test stateful, multi-step conversations with the new turns: YAML field. Each turn gets the accumulated conversation history injected automatically.

name: flight-booking-conversation
turns:
  - query: "I want to fly from NYC to Paris next Friday"
    expected:
      tools: [search_flights]
  - query: "Book the cheapest economy option"
    expected:
      tools: [book_flight]
      output:
        contains: ["confirmed", "Paris"]
  - query: "Send me a confirmation email"
    expected:
      tools: [send_email]
expected:
  tools: [search_flights, book_flight, send_email]
thresholds:
  min_score: 80

A/B endpoint comparison

Run the same test suite against two endpoints and get a per-test verdict table.

evalview compare \
  --v1 http://prod.internal/invoke --label-v1 "gpt-4o (prod)" \
  --v2 http://staging.internal/invoke --label-v2 "claude-sonnet (staging)" \
  --tests tests/

Cloud baseline sync

evalview login      # OAuth sign-in
evalview snapshot   # baselines auto-sync to cloud
evalview check      # teammates pull your baselines automatically

Other highlights

evalview capture — HTTP proxy records real agent traffic as test YAMLs
evalview install-hooks — inject regression checks into git pre-push
Silent model update detection — alerts when provider swaps model behind same API name
Gradual drift detection — OLS regression over 10-check window
Semantic diff — --semantic-diff scores by meaning, not character similarity
Auto-open HTML report after every evalview run
evalview init now auto-detects your agent endpoint and generates starter tests
Test quality gating — low-quality generated tests are skipped, not silently polluting scores
mypy clean — 0 errors across 109 source files

Community contributions

Pydantic field validation for TestCase (#54 by @illbeurs)
Edge tests for CostEvaluator and LatencyEvaluator (#55 by @illbeurs)
health_check() on OllamaAdapter (#57 by @gauravxthakur)
ConsoleReporter docstrings (#56 by @gauravxthakur)

Full changelog: CHANGELOG.md

Contributors

gauravxthakur and illbeurs

Assets 2

27 Feb 11:10

hidai25

v0.3.2

df8ad30

v0.3.2 — Fix nested claude auth for MCP users

What's fixed

claude-code adapter: auth failure in MCP context

The adapter was failing immediately (~3-4s) with "Invalid API key" when invoked through the MCP chain. Root cause: Claude Code sets ANTHROPIC_API_KEY to a session-scoped token in its subprocess environment, which the inner claude --print inherited and the Anthropic API rejected.

Fix: Strip ANTHROPIC_API_KEY from the adapter's env so the inner claude falls back to ~/.claude.json credentials (stored by claude auth login).

custom adapter: works for OAuth users (no API key needed)

The demo runner.py used the Anthropic SDK directly, which requires ANTHROPIC_API_KEY. Claude Code OAuth users don't have this env var set.

Fix: Rewrote runner to use claude --print subprocess (same auth path as the claude-code adapter).

MCP server: skill test timeout raised to 600s

Multi-test suites (10 tests × ~15s each) were hitting the previous 120s timeout.

Other improvements

Non-interactive mode for generate-tests (--auto / no TTY)
Better first-snapshot and first-check celebration panels with CI integration steps
60s asyncio timeout on LLM calls in test generator
Actionable hints when skill dependencies (e.g. mcporter) are missing

Assets 4

20 Feb 10:41

hidai25

v0.3.0

1d8b127

v0.3.0 — Claude Code MCP + Skills Testing + Telemetry

What's New in 0.3

🤖 Claude Code MCP Integration

EvalView now runs as an MCP server inside Claude Code — test your agent without leaving the conversation.

claude mcp add --transport stdio evalview -- evalview mcp serve
cp CLAUDE.md.example CLAUDE.md

7 MCP tools available:

Tool	What it does
`create_test`	Generate test cases from natural language
`run_snapshot`	Capture golden baseline
`run_check`	Detect regressions inline
`list_tests`	Show all baselines
`validate_skill`	Validate SKILL.md structure
`generate_skill_tests`	Auto-generate skill test suite
`run_skill_test`	Run Phase 1 (deterministic) + Phase 2 (rubric)

📊 Telemetry Improvements

Users now show as EvalView-3f8a2b instead of raw UUIDs in PostHog
Session duration tracking (session_duration_ms)
Set EVALVIEW_DEV=1 to tag your own events for filtering

🐕 Dogfood Regression Testing

EvalView now tests itself using its own evaluation logic on every CI run.

Bug Fixes

Fixed PIPESTATUS CI bug (regression checks now correctly fail CI)
Fixed deprecated asyncio.get_event_loop() → get_running_loop()
Fixed silent failures in --json mode
ANSI escape stripping improved in MCP output

Upgrade

pip install --upgrade evalview

Assets 4

19 Feb 06:39

hidai25

v0.2.9

07cfdc1

v0.2.9 — Clean MCP output

Bug Fix

Strip ANSI escape codes from MCP tool output so list_tests and run_snapshot return clean text to Claude Code

Upgrade

```bash
pip install --upgrade evalview
```

Assets 4

19 Feb 06:33

hidai25

v0.2.8

7ef5268

v0.2.8 — snapshot/check end-to-end fixes

Bug Fixes

Fixed adapter.run() → adapter.execute() in snapshot/check code paths (method didn't exist)
Fixed Evaluator.evaluate() called without await — coroutine was never running
Fixed _create_adapter not passing allow_private_urls to HTTPAdapter (localhost blocked)
Version now read dynamically from package metadata — no more hardcoded strings

Result

evalview snapshot and evalview check now work end-to-end against real agents. Tested against the mock agent at 100/100.

Upgrade

```bash
pip install --upgrade evalview
```

Assets 4

Releases: hidai25/eval-view

v0.5.3

HTML Report Redesign

Uh oh!

v0.5.2

What's New in v0.5.2

Cold-Start Test Generation (evalview generate)

Improved Reports

Better Onboarding (evalview init)

Check Command Improvements

Model Support

Multi-Turn & Monitoring

Bug Fixes

Docs

Uh oh!

v0.5.1

What's New

Added

Fixed

Uh oh!

v0.5.0 — Production Monitoring

What's New

Production Monitoring (evalview monitor)

Community Contributions

Bug Fixes & Refactoring

Deployment

Contributors

Uh oh!

v0.4.1

What's New

Mistral Adapter

PII Evaluator

Multi-Turn HTML Reports

Security

README

Uh oh!

v0.4.0 — Multi-turn testing, A/B comparison, Cloud sync

What's new in 0.4.0

Multi-turn conversation testing

A/B endpoint comparison

Cloud baseline sync

Other highlights

Community contributions

Contributors

Uh oh!

v0.3.2 — Fix nested claude auth for MCP users

What's fixed

claude-code adapter: auth failure in MCP context

custom adapter: works for OAuth users (no API key needed)

MCP server: skill test timeout raised to 600s

Other improvements

Uh oh!

v0.3.0 — Claude Code MCP + Skills Testing + Telemetry

What's New in 0.3

🤖 Claude Code MCP Integration

📊 Telemetry Improvements

🐕 Dogfood Regression Testing

Bug Fixes

Upgrade

Uh oh!

v0.2.9 — Clean MCP output

Bug Fix

Upgrade

Uh oh!

v0.2.8 — snapshot/check end-to-end fixes

Bug Fixes

Result

Upgrade

Uh oh!

Cold-Start Test Generation (`evalview generate`)

Better Onboarding (`evalview init`)

Production Monitoring (`evalview monitor`)