Name	Name	Last commit message	Last commit date
parent directory ..
src	src
LICENSE	LICENSE
README.md	README.md
package.json	package.json
tsconfig.json	tsconfig.json
vitest.config.ts	vitest.config.ts

@reaatech/agent-eval-harness-cli

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Command-line interface for the agent-eval-harness ecosystem. Provides 7 subcommands for full evaluation runs, on-the-fly LLM judging, baseline comparison, CI gate checking, golden trajectory management, multi-format reporting, and an MCP server in stdio mode.

Installation

npm install @reaatech/agent-eval-harness-cli
# or
npm install -g @reaatech/agent-eval-harness-cli

Feature Overview

7 subcommands — eval, judge, compare, gate, golden, report, serve
Full evaluation pipeline — load trajectories from files or directories, run multi-metric evaluation, output results as JSON or CSV
On-the-fly judging — evaluate faithfulness, relevance, tool correctness, or overall quality with a single command
CI gate checking — evaluate gate presets (standard, strict, lenient) against results with exit codes for pipeline integration
Golden trajectory management — list, create, update, and validate golden reference trajectories
Multi-format reporting — JSON, HTML, Markdown, and PDF output for evaluation results
MCP server — stdio-mode MCP server exposing all 13 eval tools to AI coding agents

Quick Start

# Install globally
npm install -g @reaatech/agent-eval-harness-cli

# Run evaluation on a directory of JSONL trajectories
agent-eval-harness eval trajectories/ --config eval-config.yaml --output results/

# Judge a single response on faithfulness
agent-eval-harness judge faithfulness \
  --context "The user's account is associated with email john@example.com" \
  --response "I've sent the password reset to john@example.com"

# Compare two evaluation runs
agent-eval-harness compare results/baseline.json results/candidate.json --format markdown

# Check CI regression gates
agent-eval-harness gate results/results.json --preset standard --exit-code

# List golden trajectories
agent-eval-harness golden --list

# Generate HTML report
agent-eval-harness report results/results.json --format html --output report.html

# Start MCP server
agent-eval-harness serve

API Reference

Binary Entry

agent-eval-harness [global-options] <command> [command-options]

Global Options

Flag	Type	Default	Description
`-v, --verbose`	`boolean`	`false`	Enable verbose output
`-c, --config <path>`	`string`	`eval-config.yaml`	Path to configuration file
`-o, --output <path>`	`string`	`results`	Output directory for results

Subcommand: `eval <paths...>`

Run full evaluation on trajectory files or directories.

Flag	Type	Default	Description
`-g, --golden <path>`	`string`	—	Path to golden trajectory for comparison
`-m, --metrics <metrics>`	`string`	—	Comma-separated list of metrics to evaluate
`--judge-model <model>`	`string`	`claude-opus`	Model to use for LLM judge
`--no-judge`	`boolean`	`false`	Disable LLM judge evaluation
`--budget <budget>`	`string`	`10.00`	Cost budget limit (USD)
`-f, --format <format>`	`string`	`json`	Output format (`json`, `junit`, `csv`)

Subcommand: `judge <aspect>`

Run LLM judge on a specific evaluation aspect.

Flag	Type	Default	Description
`-t, --trajectory <path>`	`string`	—	Path to trajectory file
`--context <text>`	`string`	—	Context for faithfulness evaluation
`--response <text>`	`string`	—	Response to evaluate
`--intent <text>`	`string`	—	User intent for relevance evaluation
`--model <model>`	`string`	`claude-opus`	Model to use for judging
`--calibrated`	`boolean`	`false`	Use calibrated scores

Valid aspects: faithfulness, relevance, tool_correctness, overall

Subcommand: `compare <baseline> <candidate>`

Compare two evaluation runs.

Flag	Type	Default	Description
`--statistical`	`boolean`	`false`	Run statistical significance tests
`-f, --format <format>`	`string`	`json`	Output format (`json`, `markdown`, `table`)

Subcommand: `gate <results>`

Check regression gates against evaluation results.

Flag	Type	Default	Description
`--gates <path>`	`string`	`gates.yaml`	Path to gate configuration file
`--preset <preset>`	`string`	`standard`	Gate preset (`standard`, `strict`, `lenient`)
`--exit-code`	`boolean`	`true`	Return CI-compatible exit code

Subcommand: `golden`

Manage golden reference trajectories.

Flag	Type	Default	Description
`-l, --list`	`boolean`	`false`	List all golden trajectories
`-c, --create <path>`	`string`	—	Create new golden trajectory from file
`-u, --update <id>`	`string`	—	Update existing golden trajectory
`-d, --delete <id>`	`string`	—	Delete golden trajectory
`--validate <path>`	`string`	—	Validate golden trajectory quality
`--dir <path>`	`string`	`golden`	Golden trajectories directory

Subcommand: `report <results>`

Generate evaluation reports.

Flag	Type	Default	Description
`-f, --format <format>`	`string`	`markdown`	Output format (`html`, `markdown`, `json`, `pdf`)
`-o, --output <path>`	`string`	—	Output file path
`--template <path>`	`string`	—	Custom report template
`--include-raw`	`boolean`	`false`	Include raw trajectory data in report

Subcommand: `serve`

Start the MCP server.

Flag	Type	Default	Description
`-p, --port <port>`	`string`	`3000`	Server port
`--host <host>`	`string`	`localhost`	Server host
`--transport <transport>`	`string`	`http`	Transport type (`http`, `stdio`)

Programmatic Use

Command functions and output helpers are available as library exports:

import {
  evalCommand,
  judgeCommand,
  compareCommand,
  gateCommand,
  goldenCommand,
  reportCommand,
  cliOut,
  cliError,
  cliWarn,
} from "@reaatech/agent-eval-harness-cli";

Type Exports

Type	Description
`EvalOptions`	Options interface for `evalCommand`
`JudgeOptions`	Options interface for `judgeCommand`
`CompareOptions`	Options interface for `compareCommand`
`GateOptions`	Options interface for `gateCommand`
`GoldenOptions`	Options interface for `goldenCommand`
`ReportOptions`	Options interface for `reportCommand`

Usage Patterns

Using in Docker

# Build the image
docker build -t agent-eval-harness .

# Run evaluation with mounted volumes
docker run -v ./trajectories:/app/trajectories \
  -v ./results:/app/results \
  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
  agent-eval-harness eval trajectories/ --output results/

# Start MCP server in stdio mode
docker run -i agent-eval-harness serve

CI Pipeline Integration

Use the gate subcommand in CI workflows to block regressions:

# .github/workflows/eval.yml
name: Agent Evaluation

on:
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run evaluation suite
        run: |
          npx @reaatech/agent-eval-harness-cli eval trajectories/ \
            --config eval-config.yaml \
            --output results/

      - name: Run regression gates
        run: |
          npx @reaatech/agent-eval-harness-cli gate results/results.json \
            --preset standard \
            --exit-code

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/

The --exit-code flag causes the command to exit with code 1 when any gate fails, failing the CI step.

Gate presets provide ready-made thresholds:

Preset	Overall Quality	Cost Limit	Latency P99	Tool Correctness	Faithfulness
`standard`	>= 0.80	<= $0.05	<= 5000ms	>= 0.90	>= 0.80
`strict`	>= 0.90	<= $0.02	<= 2000ms	>= 0.95	>= 0.90
`lenient`	>= 0.60	<= $0.10	<= 10000ms	>= 0.70	>= 0.60

Related Packages

Package	Description
@reaatech/agent-eval-harness-types	Shared domain types and schemas
@reaatech/agent-eval-harness-trajectory	Trajectory evaluation
@reaatech/agent-eval-harness-tool-use	Tool-use validation
@reaatech/agent-eval-harness-cost	Cost tracking
@reaatech/agent-eval-harness-latency	Latency monitoring
@reaatech/agent-eval-harness-judge	LLM-as-judge
@reaatech/agent-eval-harness-golden	Golden trajectories
@reaatech/agent-eval-harness-suite	Suite runner
@reaatech/agent-eval-harness-gate	CI gates
@reaatech/agent-eval-harness-mcp-server	MCP server
@reaatech/agent-eval-harness-observability	Observability

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

@reaatech/agent-eval-harness-cli

Installation

Feature Overview

Quick Start

API Reference

Binary Entry

Global Options

Subcommand: `eval <paths...>`

Subcommand: `judge <aspect>`

Subcommand: `compare <baseline> <candidate>`

Subcommand: `gate <results>`

Subcommand: `golden`

Subcommand: `report <results>`

Subcommand: `serve`

Programmatic Use

Type Exports

Usage Patterns

Using in Docker

CI Pipeline Integration

Related Packages

License

FilesExpand file tree

cli

Directory actions

More options

Directory actions

More options

Latest commit

History

cli

Folders and files

parent directory

README.md

@reaatech/agent-eval-harness-cli

Installation

Feature Overview

Quick Start

API Reference

Binary Entry

Global Options

Subcommand: eval <paths...>

Subcommand: judge <aspect>

Subcommand: compare <baseline> <candidate>

Subcommand: gate <results>

Subcommand: golden

Subcommand: report <results>

Subcommand: serve

Programmatic Use

Type Exports

Usage Patterns

Using in Docker

CI Pipeline Integration

Related Packages

License

Subcommand: `eval <paths...>`

Subcommand: `judge <aspect>`

Subcommand: `compare <baseline> <candidate>`

Subcommand: `gate <results>`

Subcommand: `golden`

Subcommand: `report <results>`

Subcommand: `serve`