@@ -42,10 +42,10 @@ EvalView is a **testing framework for AI agents**.
4242
4343It lets you:
4444
45- - 🧪 ** Write tests in YAML** that describe inputs, expected tools, and acceptance thresholds
46- - 🔁 ** Turn real conversations into regression suites** (record → generate tests → re-run on every change)
47- - 🚦 ** Gate deployments in CI** on behavior, tool calls, cost, and latency
48- - 🧩 Plug into ** LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HTTP agents** , and more
45+ - ** Write tests in YAML** that describe inputs, expected tools, and acceptance thresholds
46+ - ** Turn real conversations into regression suites** (record → generate tests → re-run on every change)
47+ - ** Gate deployments in CI** on behavior, tool calls, cost, and latency
48+ - Plug into ** LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HTTP agents** , and more
4949
5050Think: _ "pytest / Playwright mindset, but for multi-step agents and tool-calling workflows."_
5151
@@ -122,10 +122,10 @@ evalview quickstart
122122
123123You'll see a full run with:
124124
125- - ✅ A demo agent spinning up
126- - ✅ A test case created for you
127- - ✅ A config file wired up
128- - 📊 A scored test: tools used, output quality, cost, latency
125+ - A demo agent spinning up
126+ - A test case created for you
127+ - A config file wired up
128+ - A scored test: tools used, output quality, cost, latency
129129
130130### Run examples directly (no config needed)
131131
@@ -259,10 +259,10 @@ Database config is optional – EvalView only uses it if you enable it in config
259259
260260## Why EvalView?
261261
262- - 🔓 **Fully Open Source** – Apache 2.0 licensed, runs entirely on your infra, no SaaS lock-in
263- - 🔌 **Framework-agnostic** – Works with LangGraph, CrewAI, OpenAI, Anthropic, or any HTTP API
264- - 🚀 **Production-ready** – Parallel execution, CI/CD integration, configurable thresholds
265- - 🧩 **Extensible** – Custom adapters, evaluators, and reporters for your stack
262+ - **Fully Open Source** – Apache 2.0 licensed, runs entirely on your infra, no SaaS lock-in
263+ - **Framework-agnostic** – Works with LangGraph, CrewAI, OpenAI, Anthropic, or any HTTP API
264+ - **Production-ready** – Parallel execution, CI/CD integration, configurable thresholds
265+ - **Extensible** – Custom adapters, evaluators, and reporters for your stack
266266
267267---
268268
@@ -357,7 +357,7 @@ $ evalview run
357357
358358---
359359
360- ## 🚀 Generate 1000 Tests from 1
360+ ## Generate 1000 Tests from 1
361361
362362**Problem:** Writing tests manually is slow. You need volume to catch regressions.
363363
@@ -387,9 +387,9 @@ evalview record --interactive
387387```
388388
389389EvalView captures:
390- - ✅ Query → Tools called → Output
391- - ✅ Auto-generates test YAML
392- - ✅ Adds reasonable thresholds
390+ - Query → Tools called → Output
391+ - Auto-generates test YAML
392+ - Adds reasonable thresholds
393393
394394** Result:** Go from 5 manual tests → 500 comprehensive tests in minutes.
395395
@@ -411,40 +411,40 @@ evalview run
411411```
412412
413413Supports 7+ frameworks with automatic detection:
414- ✅ LangGraph • ✅ CrewAI • ✅ OpenAI Assistants • ✅ Anthropic Claude • ✅ AutoGen • ✅ Dify • ✅ Custom APIs
414+ LangGraph • CrewAI • OpenAI Assistants • Anthropic Claude • AutoGen • Dify • Custom APIs
415415
416416---
417417
418- ## ☁️ EvalView Cloud (Coming Soon)
418+ ## EvalView Cloud (Coming Soon)
419419
420420We're building a hosted version:
421421
422- - 📊 ** Dashboard** - Visual test history, trends, and pass/fail rates
423- - 👥 ** Teams** - Share results and collaborate on fixes
424- - 🔔 ** Alerts** - Slack/Discord notifications on failures
425- - 📈 ** Regression detection** - Automatic alerts when performance degrades
426- - ⚡ ** Parallel runs** - Run hundreds of tests in seconds
422+ - ** Dashboard** - Visual test history, trends, and pass/fail rates
423+ - ** Teams** - Share results and collaborate on fixes
424+ - ** Alerts** - Slack/Discord notifications on failures
425+ - ** Regression detection** - Automatic alerts when performance degrades
426+ - ** Parallel runs** - Run hundreds of tests in seconds
427427
428- 👉 ** [ Join the waitlist] ( https://form.typeform.com/to/EQO2uqSa ) ** - be first to get access
428+ ** [ Join the waitlist] ( https://form.typeform.com/to/EQO2uqSa ) ** - be first to get access
429429
430430---
431431
432432## Features
433433
434- - 🚀 ** Test Expansion** - Generate 100+ test variations from a single seed test
435- - 🎥 ** Test Recording** - Auto-generate tests from live agent interactions
436- - ✅ ** YAML-based test cases** - Write readable, maintainable test definitions
437- - ⚡ ** Parallel execution** - Run tests concurrently (8x faster by default)
438- - 📊 ** Multiple evaluation metrics** - Tool accuracy, sequence correctness, output quality, cost, and latency
439- - 🤖 ** LLM-as-judge** - Automated output quality assessment
440- - 💰 ** Cost tracking** - Automatic cost calculation based on token usage
441- - 🔌 ** Universal adapters** - Works with any HTTP or streaming API
442- - 🎨 ** Rich console output** - Beautiful, informative test results
443- - 📁 ** JSON & HTML reports** - Interactive HTML reports with Plotly charts
444- - 🔄 ** Retry logic** - Automatic retries with exponential backoff for flaky tests
445- - 👀 ** Watch mode** - Re-run tests automatically on file changes
446- - ⚖️ ** Configurable weights** - Customize scoring weights globally or per-test
447- - 📊 ** Statistical mode** - Run tests N times, get variance metrics and flakiness scores
434+ - ** Test Expansion** - Generate 100+ test variations from a single seed test
435+ - ** Test Recording** - Auto-generate tests from live agent interactions
436+ - ** YAML-based test cases** - Write readable, maintainable test definitions
437+ - ** Parallel execution** - Run tests concurrently (8x faster by default)
438+ - ** Multiple evaluation metrics** - Tool accuracy, sequence correctness, output quality, cost, and latency
439+ - ** LLM-as-judge** - Automated output quality assessment
440+ - ** Cost tracking** - Automatic cost calculation based on token usage
441+ - ** Universal adapters** - Works with any HTTP or streaming API
442+ - ** Rich console output** - Beautiful, informative test results
443+ - ** JSON & HTML reports** - Interactive HTML reports with Plotly charts
444+ - ** Retry logic** - Automatic retries with exponential backoff for flaky tests
445+ - ** Watch mode** - Re-run tests automatically on file changes
446+ - ** Configurable weights** - Customize scoring weights globally or per-test
447+ - ** Statistical mode** - Run tests N times, get variance metrics and flakiness scores
448448
449449---
450450
0 commit comments