Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

@reaatech/agent-eval-harness-tool-use

npm version License: MIT CI

Status: Pre-1.0 — APIs may change in minor versions. Pin to a specific version in production.

Tool-call validation and result verification for agent trajectories. Validates tool selection against schemas, checks argument compliance, detects hallucinated results, and verifies proper result integration into agent responses.

Installation

npm install @reaatech/agent-eval-harness-tool-use

Feature Overview

  • Tool selection validation — checks that the agent picked the right tool for the task
  • Schema compliance — validates tool arguments against JSON Schema or custom ToolSchema definitions
  • Result verification — detects hallucinated results that don't match actual tool output
  • Integration checking — verifies tool results are properly used in agent responses
  • 13 issue types — structured categorization of tool-use problems from critical (missing tool name) to low (result unused)
  • Trajectory-wide summarization — aggregate result verification across all tool calls

Quick Start

import { validateToolCall, createToolSchema, verifyResult } from '@reaatech/agent-eval-harness-tool-use';
import type { ToolCall, Turn } from '@reaatech/agent-eval-harness-types';

const schema = createToolSchema('send_email', {
  properties: { to: { type: 'string', format: 'email' }, subject: { type: 'string' } },
  required: ['to']
});

const call: ToolCall = { name: 'send_email', arguments: { to: 'user@example.com', subject: 'Hi' }, result: { status: 'sent' } };
const turn: Turn = { turn_id: 2, role: 'agent', content: 'Email sent!', timestamp: '2026-04-15T00:00:00Z', tool_calls: [call] };

const validation = validateToolCall(call, schema);
console.log(`Valid: ${validation.valid}, Score: ${validation.score}`);

const verification = verifyResult(call, turn);
console.log(`Hallucinated: ${verification.hallucinated}, Integrated: ${verification.integrated}`);

API Reference

Validation Functions

Export Signature Description
validateTrajectory (trajectory: Trajectory, toolSchemas?: Record<string, ToolSchema>, options?: ValidateOptions) => ValidationResult[] Validates all tool calls across every agent turn in a trajectory. Returns one ValidationResult per agent turn with tool calls.
validateTurn (turn: Turn, toolSchemas?: Record<string, ToolSchema>, options?: ValidateOptions) => ValidationResult Validates all tool calls in a single turn. Handles missing_tool_name, unknown_tool, deprecated_tool, missing_arguments, missing_result, schema violations, and hallucination detection.
validateToolCall (toolCall: ToolCall, schema?: ToolSchema, options?: ValidateOptions) => ValidationResult Validates a single tool call against an optional schema. Convenience wrapper that creates a synthetic turn internally.

Schema Functions

Export Signature Description
validateSchema (toolCall: ToolCall, schema: ToolSchema) => SchemaValidationResult Deep schema validation of tool arguments against a ToolSchema. Checks required fields, types, enums, formats (email, uri, date, date-time), and nested object/array properties.
createToolSchema (name: string, jsonSchema: Record<string, unknown>, description?: string) => ToolSchema Creates a ToolSchema from a JSON Schema-like definition. Converts properties and required arrays into the internal ToolSchema parameter structure.

Result Verification Functions

Export Signature Description
verifyResult (toolCall: ToolCall, turn: Turn, trajectory?: Trajectory, options?: VerifyOptions) => ResultVerificationResult Verifies a single tool call's result against the agent's response. Checks for hallucination, result integration, contradictions, and missing/empty/error results. Accepts optional full trajectory for cross-turn usage detection.
verifyTurnResults (turn: Turn, trajectory?: Trajectory, options?: VerifyOptions) => ResultVerificationResult[] Runs verifyResult on every tool call in a turn. Returns an array of verification results.
summarizeResultVerification (trajectory: Trajectory, options?: VerifyOptions) => { totalTools, validResults, hallucinatedResults, integratedResults, averageScore, issues } Aggregates result verification across an entire trajectory. Returns counts for total tools, valid results, hallucinated results, integrated results, average score, and all issues.

Types

ToolSchema

interface ToolSchema {
  name: string;
  description?: string;
  parameters: {
    type: 'object';
    properties: Record<string, ParameterSchema>;
    required?: string[];
  };
  deprecated?: boolean;
  replacedBy?: string;
}

interface ParameterSchema {
  type: 'string' | 'number' | 'boolean' | 'object' | 'array';
  description?: string;
  enum?: unknown[];
  format?: string;
  items?: ParameterSchema;
  properties?: Record<string, ParameterSchema>;
}

ValidationResult

interface ValidationResult {
  valid: boolean;          // true if no critical issues
  issues: ToolUseIssue[];  // all detected issues
  suggestions: string[];   // remediation suggestions (e.g., deprecated tool replacement)
  score: number;           // 0.0–1.0 weighted by issue severity
}

interface ToolUseIssue {
  type: ToolUseIssueType;
  severity: 'low' | 'medium' | 'high' | 'critical';
  description: string;
  turnId?: number;
  toolName?: string;
  details?: Record<string, unknown>;
}

ValidateOptions

interface ValidateOptions {
  allowUnknownTools?: boolean;   // default: false — set true to skip unknown tool errors
  validateSchemas?: boolean;     // default: true — enable parameter-level schema checks
  checkResultUsage?: boolean;    // default: true — check for unused tool results
  detectHallucination?: boolean; // default: true — check for fabricated result usage
  strict?: boolean;              // default: false — when true, score drops to 0.0 if any high/critical issue
}

SchemaValidationResult

interface SchemaValidationResult {
  valid: boolean;
  issues: SchemaIssue[];
  score: number;
}

interface SchemaIssue {
  type: string;         // e.g., 'missing_arguments', 'type_error', 'invalid_format', 'required_field_missing'
  severity: 'low' | 'medium' | 'high' | 'critical';
  path: string;         // dot-notation path to the problematic parameter
  message: string;
  expected?: unknown;
  actual?: unknown;
}

ResultVerificationResult

interface ResultVerificationResult {
  valid: boolean;
  issues: ResultIssue[];
  score: number;
  hallucinated: boolean;  // true if hallucination score exceeds threshold
  integrated: boolean;    // true if result values appear in the agent response
}

interface ResultIssue {
  type: ResultIssueType;
  severity: 'low' | 'medium' | 'high' | 'critical';
  description: string;
  turnId?: number;
  toolName?: string;
  details?: Record<string, unknown>;
}

VerifyOptions

interface VerifyOptions {
  checkUsage?: boolean;             // default: true — verify result usage in response
  detectHallucination?: boolean;    // default: true — detect fabricated result content
  checkContradictions?: boolean;    // default: true — catch result/response contradictions
  hallucinationThreshold?: number;  // default: 0.3 — score above this triggers hallucinated flag
}

Enums

ToolUseIssueType (13 values)

Value Severity Description
missing_tool_name critical Tool call has no name field
missing_arguments high Tool call has no arguments field
invalid_arguments medium Argument value not in allowed enum
tool_not_found high Tool name not in provided schemas
tool_misuse medium Tool used incorrectly for the context
missing_result medium Tool was called but no result returned
result_unused low Tool result fields not found in agent response
hallucinated_result high Agent response references data not in the actual tool result
schema_violation high Arguments fail schema-level validation
type_mismatch high Argument type does not match schema (e.g., string for number)
missing_required_param high Required parameter missing from arguments
unknown_tool high/medium Tool name not recognized; severity depends on strict mode
deprecated_tool medium Tool is marked as deprecated; suggestion includes replacement

ResultIssueType (8 values)

Value Severity Description
missing_result medium Tool call has no result object
empty_result low Tool returned an empty result ({})
error_result high Result status is 'error'
hallucinated_content high Response contains fabricated data not in the result
unused_result medium Result values not referenced in agent response
contradicts_response high Result indicates success but response says failure (or vice versa)
incomplete_integration medium Only partial result data used in response
malformed_result high Result structure is unexpected or invalid

Related Packages

Package Description
@reaatech/agent-eval-harness-types Shared domain types and schemas
@reaatech/agent-eval-harness-trajectory Trajectory evaluation
@reaatech/agent-eval-harness-tool-use Tool-use validation
@reaatech/agent-eval-harness-cost Cost tracking
@reaatech/agent-eval-harness-latency Latency monitoring
@reaatech/agent-eval-harness-judge LLM-as-judge
@reaatech/agent-eval-harness-golden Golden trajectories
@reaatech/agent-eval-harness-suite Suite runner
@reaatech/agent-eval-harness-gate CI gates
@reaatech/agent-eval-harness-mcp-server MCP server
@reaatech/agent-eval-harness-cli CLI
@reaatech/agent-eval-harness-observability Observability

License

MIT