Name	Name	Last commit message	Last commit date
parent directory ..
src	src
LICENSE	LICENSE
README.md	README.md
package.json	package.json
tsconfig.json	tsconfig.json
vitest.config.ts	vitest.config.ts

@reaatech/agent-eval-harness-observability

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

OpenTelemetry tracing, metrics collection, structured logging, and in-memory dashboards for agent evaluation pipelines. Provides 7 pre-configured OTel instruments, Pino-based structured logging with automatic PII redaction, and a 24-hour dashboard manager with trend analysis and alerting.

Installation

npm install @reaatech/agent-eval-harness-observability
# or
pnpm add @reaatech/agent-eval-harness-observability

Feature Overview

OTel tracing — automatic span generation for eval.run → trajectory.load → judge.evaluate → gate.check pipelines
7 pre-configured metrics — runs.total, trajectories.evaluated, judge.calls, judge.cost, gates.result, cost.per_task, latency.p99
Pino structured logging — JSON logs with automatic PII redaction (emails, phones, SSNs, API keys, tokens)
Tracing decorators — withTracing() wrapper for adding custom spans with automatic context propagation
Dashboard manager — in-memory 24-hour data retention with quality, cost, latency, and pass-rate panels and 4 alert types
Multiple exporters — OTLP gRPC, Zipkin, and Console for local development

Quick Start

import {
  getLogger,
  getTracingManager,
  getMetricsManager,
  getDashboardManager,
} from "@reaatech/agent-eval-harness-observability";

// Structured logging with automatic PII redaction
const logger = getLogger();
logger.info({ runId: "eval-123", trajectories: 50 }, "Evaluation started");
logger.error({ err: new Error("Connection lost") }, "Judge API call failed");

// Metrics recording
const metrics = getMetricsManager();
metrics.recordRun("success", 1);
metrics.recordTrajectories("production", 50);
metrics.recordJudgeCall("claude-opus", "success");
metrics.recordJudgeCost("claude-opus", 0.0234);
metrics.recordGateResult("overall-quality", true);
metrics.recordLatencyP99("evaluation", 3200);

// Dashboard with trend analysis and alerting
const dashboard = getDashboardManager();
dashboard.recordRun({
  overallMetrics: { overallScore: 0.87, avgCostPerTask: 0.05, latencyP99: 3200 },
  summary: { totalTrajectories: 50, passRate: 92 },
  metricBreakdown: { faithfulness: { avgScore: 0.85 } },
});

console.log(`Quality trend: ${dashboard.getSummary().trends.score}`);
console.log(`Active alerts: ${dashboard.getAlerts().length}`);

API Reference

Logger

import { getLogger, createChildLogger, setGlobalRunId, getGlobalRunId } from "@reaatech/agent-eval-harness-observability";

Export	Description
`getLogger(config?)`	Returns the singleton Logger instance, configured lazily
`createChildLogger(bindings)`	Creates a child logger with additional context fields
`setGlobalRunId(runId)`	Sets the run ID for log correlation
`getGlobalRunId()`	Returns the current global run ID, or `null`

`LoggerConfig`

Property	Type	Default	Description
`level`	`string`	`"info"`	Minimum log level (`trace`, `debug`, `info`, `warn`, `error`, `fatal`)
`format`	`"json" \| "pretty"`	`"pretty"` (dev), `"json"` (prod)	Log output format
`includeRunId`	`boolean`	`true`	Include run ID on every log line
`piiPatterns`	`RegExp[]`	emails, phones, SSNs, API keys, tokens	PII redaction patterns
`redactFields`	`string[]`	`password`, `secret`, `token`, `apiKey`, `api_key`, `authorization`	Field-level redaction

Logger Instance Methods

Method	Description
`trace(msg, ...args)`	Log at trace level
`debug(msg, ...args)`	Log at debug level
`info(msg, ...args)`	Log at info level
`warn(msg, ...args)`	Log at warn level
`error(msg, ...args)`	Log at error level
`fatal(msg, ...args)`	Log at fatal level
`child(bindings)`	Create child logger with additional context
`logEvalRunStart(runId, trajectoryCount, config)`	Log evaluation run start
`logEvalRunEnd(runId, metrics, duration)`	Log evaluation run completion
`logGateEvaluation(gateName, passed, reason)`	Log gate result
`logCost(runId, cost, breakdown)`	Log cost tracking
`logError(error, context?)`	Log error with optional context

Metrics

import { getMetricsManager, recordMetric, incrementCounter } from "@reaatech/agent-eval-harness-observability";

`MetricsConfig`

Property	Type	Default	Description
`serviceName`	`string`	`"agent-eval-harness"`	Service name for OTel resource
`enabled`	`boolean`	`true`	Enable metrics collection
`exporter`	`"otlp" \| "prometheus" \| "console" \| "none"`	`"console"`	Metrics exporter type
`otlpEndpoint`	`string`	—	OTLP collector endpoint
`prometheusPort`	`number`	—	Prometheus scrape port
`exportInterval`	`number`	`60000`	Export interval in milliseconds

`MetricsManager` Instance Methods

Method	Description
`init()`	Initialize metrics and register instruments
`recordRun(status, count?)`	Record evaluation run as counter
`recordTrajectories(dataset, count?)`	Record trajectories evaluated
`recordJudgeCall(model, status)`	Record judge API call
`recordJudgeCost(model, cost)`	Record judge cost as histogram
`recordCostPerTask(taskType, cost)`	Record cost per task
`recordGateResult(gateName, passed)`	Record gate pass/fail (1/0)
`recordLatencyP99(component, latencyMs)`	Record P99 latency
`recordBatchMetrics(metrics)`	Record multiple metrics in one call
`forceFlush()`	Force flush pending metrics
`shutdown()`	Shutdown metrics provider

Standalone Helpers

Export	Description
`recordMetric(name, value, attributes?)`	Record a metric by name to the current provider
`incrementCounter(name, value?, attributes?)`	Increment a counter by name

7 Pre-Configured OTel Instruments

Name	Type	Unit	Description
`agent_eval.runs.total`	Counter	`runs`	Total evaluation runs
`agent_eval.trajectories.evaluated`	Counter	`trajectories`	Trajectories processed
`agent_eval.judge.calls`	Counter	`calls`	LLM judge API calls
`agent_eval.judge.cost`	Histogram	`USD`	Judge cost per run
`agent_eval.gates.result`	Histogram	`boolean`	Gate pass/fail (1/0)
`agent_eval.cost.per_task`	Histogram	`USD`	Cost per task
`agent_eval.latency.p99`	Histogram	`ms`	P99 latency per run

Tracing

import { getTracingManager, withTracing, addSpanAttributes } from "@reaatech/agent-eval-harness-observability";

`TracingConfig`

Property	Type	Default	Description
`serviceName`	`string`	`"agent-eval-harness"`	Service name for OTel resource
`version`	`string`	`"1.0.0"`	Service version
`enabled`	`boolean`	`true`	Enable tracing
`exporter`	`"otlp" \| "zipkin" \| "console" \| "none"`	`"console"`	Span exporter type
`otlpEndpoint`	`string`	—	OTLP collector endpoint
`zipkinEndpoint`	`string`	—	Zipkin collector endpoint
`sampleRate`	`number`	`1.0`	Sampling rate (0–1)

`TracingManager` Instance Methods

Method	Description
`init()`	Initialize tracing provider and register exporters
`startEvalRunSpan(runId, config)`	Create span for evaluation run
`startTrajectoryLoadSpan(path, format)`	Create span for trajectory loading
`startJudgeSpan(model, metric)`	Create span for judge evaluation
`startGateSpan(gateCount)`	Create span for gate checking
`endSpan(span, result?, error?)`	End span with optional result or error
`getCurrentContext()`	Get current OTel context
`injectContext(headers)`	Inject context into carrier headers
`extractContext(headers)`	Extract context from carrier headers
`shutdown()`	Shutdown tracing provider

Standalone Helpers

Export	Description
`withTracing(spanName, fn, attributes?)`	Wrap async function with tracing span
`addSpanAttributes(attributes)`	Add attributes to current active span

Dashboard

import { getDashboardManager } from "@reaatech/agent-eval-harness-observability";

`DashboardConfig`

Property	Type	Default	Description
`trendHours`	`number`	`24`	Time range for trend data in hours
`alertThresholds.qualityScore`	`number`	`0.8`	Alert when overall score drops below
`alertThresholds.costPerTask`	`number`	`0.05`	Alert when cost exceeds this value
`alertThresholds.latencyP99`	`number`	`5000`	Alert when P99 latency exceeds (ms)
`alertThresholds.passRate`	`number`	`0.95`	Alert when pass rate drops below
`trendWindow`	`number`	`3`	Number of data points for trend calculation

`DashboardManager` Instance Methods

Method	Description
`recordRun(results)`	Record evaluation metrics from `AggregatedResults`
`getMetrics()`	Get all metric series data
`getAlerts()`	Get current alert messages
`getTrendData(metric, points?)`	Get trend data for a specific metric
`getSummary()`	Get dashboard summary with trends and alerts
`generateDashboard()`	Generate full dashboard panels

4 Alert Types

Alert Metric	Condition	When
`quality_drop`	`overallScore < alertThresholds.qualityScore`	Quality falls below threshold
`cost_spike`	`avgCostPerTask > alertThresholds.costPerTask`	Cost exceeds threshold
`latency_spike`	`latencyP99 > alertThresholds.latencyP99`	P99 latency exceeds threshold
`pass_rate_drop`	`passRate / 100 < alertThresholds.passRate`	Pass rate falls below threshold

4 Dashboard Panels

Panel	Type	Metrics Tracked
Quality	`chart`	`overall_score`, `pass_rate`
Performance	`chart`	`latency_p99`, `cost_per_task`
Key Statistics	`stat`	Current score and pass rate with trend direction
Alerts	`alert`	Active alert messages with values and thresholds

`DashboardSummary`

interface DashboardSummary {
  totalRuns: number;
  currentScore: number | null;
  currentPassRate: number | null;
  currentCostPerTask: number | null;
  currentLatencyP99: number | null;
  activeAlerts: number;
  trends: {
    score: "up" | "down" | "stable";
    passRate: "up" | "down" | "stable";
  };
}

Usage Patterns

Custom Spans with `withTracing`

Wrap any async operation to automatically create, time, and finalize an OTel span:

import { withTracing, addSpanAttributes } from "@reaatech/agent-eval-harness-observability";

const result = await withTracing(
  "custom_validation",
  async (span) => {
    // Span is active throughout this block
    addSpanAttributes({ validation_type: "schema", schema_version: "2.1" });

    const isValid = await validateInput(payload);
    return { isValid, timestamp: Date.now() };
  },
  { "custom.attribute": "value" },
);

// Span automatically ends — success status on return, error on throw

Dashboards and Alerting

Record evaluation runs to populate the dashboard, then query trends and alerts:

import { getDashboardManager } from "@reaatech/agent-eval-harness-observability";

const dashboard = getDashboardManager({
  alertThresholds: { qualityScore: 0.85, costPerTask: 0.03, latencyP99: 3000, passRate: 0.90 },
});

// Record a run
dashboard.recordRun({
  overallMetrics: { overallScore: 0.82, avgCostPerTask: 0.04, latencyP99: 4500 },
  summary: { totalTrajectories: 100, passRate: 88 },
  metricBreakdown: {},
});

// Check summary
const summary = dashboard.getSummary();
console.log(`Runs: ${summary.totalRuns}`);
console.log(`Score trend: ${summary.trends.score}`);
console.log(`Active alerts: ${summary.activeAlerts}`);

// Inspect alerts
for (const alert of dashboard.getAlerts()) {
  console.log(`[${alert.level}] ${alert.metric}: ${alert.message}`);
}

// Generate full dashboard panels (chart, stat, alert)
const panels = dashboard.generateDashboard();
for (const panel of panels) {
  console.log(`${panel.title} (${panel.type}): ${panel.metrics.length} metrics`);
}

Structured Logging with Context

import { getLogger, createChildLogger, setGlobalRunId } from "@reaatech/agent-eval-harness-observability";

const logger = getLogger();
setGlobalRunId("eval-run-42");

// All subsequent log lines include run_id: "eval-run-42"
logger.info({ taskType: "password-reset" }, "Starting evaluation");

// Create per-component child loggers
const judgeLogger = createChildLogger({ component: "judge" });
judgeLogger.info({ model: "claude-opus", metric: "faithfulness" }, "Judge evaluating");

// Errors carry stack traces
try {
  await doWork();
} catch (err) {
  logger.logError(err as Error, { taskId: "task-7" });
}

Metrics Batching

import { getMetricsManager } from "@reaatech/agent-eval-harness-observability";

const metrics = getMetricsManager();

metrics.recordBatchMetrics({
  runs: { status: "success" },
  trajectories: { dataset: "production" },
  judgeCalls: { model: "claude-opus", status: "success" },
  judgeCost: { model: "claude-opus", cost: 0.0234 },
  costPerTask: { taskType: "password-reset", cost: 0.0045 },
  gateResult: { gateName: "overall-quality", passed: true },
  latencyP99: { component: "evaluation", latencyMs: 3200 },
});

Related Packages

Package	Description
@reaatech/agent-eval-harness-types	Shared domain types and schemas
@reaatech/agent-eval-harness-trajectory	Trajectory evaluation
@reaatech/agent-eval-harness-tool-use	Tool-use validation
@reaatech/agent-eval-harness-cost	Cost tracking
@reaatech/agent-eval-harness-latency	Latency monitoring
@reaatech/agent-eval-harness-judge	LLM-as-judge
@reaatech/agent-eval-harness-golden	Golden trajectories
@reaatech/agent-eval-harness-suite	Suite runner
@reaatech/agent-eval-harness-gate	CI gates
@reaatech/agent-eval-harness-mcp-server	MCP server
@reaatech/agent-eval-harness-cli	CLI

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

@reaatech/agent-eval-harness-observability

Installation

Feature Overview

Quick Start

API Reference

Logger

`LoggerConfig`

Logger Instance Methods

Metrics

`MetricsConfig`

`MetricsManager` Instance Methods

Standalone Helpers

7 Pre-Configured OTel Instruments

Tracing

`TracingConfig`

`TracingManager` Instance Methods

Standalone Helpers

Dashboard

`DashboardConfig`

`DashboardManager` Instance Methods

4 Alert Types

4 Dashboard Panels

`DashboardSummary`

Usage Patterns

Custom Spans with `withTracing`

Dashboards and Alerting

Structured Logging with Context

Metrics Batching

Related Packages

License

FilesExpand file tree

observability

Directory actions

More options

Directory actions

More options

Latest commit

History

observability

Folders and files

parent directory

README.md

@reaatech/agent-eval-harness-observability

Installation

Feature Overview

Quick Start

API Reference

Logger

LoggerConfig

Logger Instance Methods

Metrics

MetricsConfig

MetricsManager Instance Methods

Standalone Helpers

7 Pre-Configured OTel Instruments

Tracing

TracingConfig

TracingManager Instance Methods

Standalone Helpers

Dashboard

DashboardConfig

DashboardManager Instance Methods

4 Alert Types

4 Dashboard Panels

DashboardSummary

Usage Patterns

Custom Spans with withTracing

Dashboards and Alerting

Structured Logging with Context

Metrics Batching

Related Packages

License

`LoggerConfig`

`MetricsConfig`

`MetricsManager` Instance Methods

`TracingConfig`

`TracingManager` Instance Methods

`DashboardConfig`

`DashboardManager` Instance Methods

`DashboardSummary`

Custom Spans with `withTracing`