|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +bo-eval-server is a WebSocket-based evaluation server for LLM agents that implements an LLM-as-a-judge evaluation system. The server accepts connections from AI agents, sends them evaluation tasks via RPC calls, collects their responses, and uses an LLM to judge the quality of responses. |
| 8 | + |
| 9 | +## Commands |
| 10 | + |
| 11 | +### Development |
| 12 | +- `npm start` - Start the WebSocket server |
| 13 | +- `npm run dev` - Start server with file watching for development |
| 14 | +- `npm run cli` - Start interactive CLI for server management and testing |
| 15 | +- `npm test` - Run example agent client for testing |
| 16 | + |
| 17 | +### Installation |
| 18 | +- `npm install` - Install dependencies |
| 19 | +- Copy `.env.example` to `.env` and configure environment variables |
| 20 | + |
| 21 | +### Required Environment Variables |
| 22 | +- `OPENAI_API_KEY` - OpenAI API key for LLM judge functionality |
| 23 | +- `PORT` - WebSocket server port (default: 8080) |
| 24 | + |
| 25 | +## Architecture |
| 26 | + |
| 27 | +### Core Components |
| 28 | + |
| 29 | +**WebSocket Server** (`src/server.js`) |
| 30 | +- Accepts connections from LLM agents |
| 31 | +- Manages agent lifecycle (connect, ready, disconnect) |
| 32 | +- Orchestrates evaluation sessions |
| 33 | +- Handles bidirectional RPC communication |
| 34 | + |
| 35 | +**RPC Client** (`src/rpc-client.js`) |
| 36 | +- Implements JSON-RPC 2.0 protocol for server-to-client calls |
| 37 | +- Manages request/response correlation with unique IDs |
| 38 | +- Handles timeouts and error conditions |
| 39 | +- Calls `Evaluate(request: String) -> String` method on connected agents |
| 40 | + |
| 41 | +**LLM Evaluator** (`src/evaluator.js`) |
| 42 | +- Integrates with OpenAI API for LLM-as-a-judge functionality |
| 43 | +- Evaluates agent responses on multiple criteria (correctness, completeness, clarity, relevance, helpfulness) |
| 44 | +- Returns structured JSON evaluation with scores and reasoning |
| 45 | + |
| 46 | +**Logger** (`src/logger.js`) |
| 47 | +- Structured logging using Winston |
| 48 | +- Separate log files for different event types |
| 49 | +- JSON format for easy parsing and analysis |
| 50 | +- Logs all RPC calls, evaluations, and connection events |
| 51 | + |
| 52 | +### Evaluation Flow |
| 53 | + |
| 54 | +1. Agent connects to WebSocket server |
| 55 | +2. Agent sends "ready" signal |
| 56 | +3. Server calls agent's `Evaluate` method with a task |
| 57 | +4. Agent processes task and returns response |
| 58 | +5. Server sends response to LLM judge for evaluation |
| 59 | +6. Results are logged as JSON with scores and detailed feedback |
| 60 | + |
| 61 | +### Project Structure |
| 62 | + |
| 63 | +``` |
| 64 | +src/ |
| 65 | +├── server.js # Main WebSocket server and evaluation orchestration |
| 66 | +├── rpc-client.js # JSON-RPC client for calling agent methods |
| 67 | +├── evaluator.js # LLM judge integration (OpenAI) |
| 68 | +├── logger.js # Structured logging and result storage |
| 69 | +├── config.js # Configuration management |
| 70 | +└── cli.js # Interactive CLI for testing and management |
| 71 | +
|
| 72 | +logs/ # Log files (created automatically) |
| 73 | +├── combined.log # All log events |
| 74 | +├── error.log # Error events only |
| 75 | +└── evaluations.jsonl # Evaluation results in JSON Lines format |
| 76 | +``` |
| 77 | + |
| 78 | +### Key Features |
| 79 | + |
| 80 | +- **Bidirectional RPC**: Server can call methods on connected clients |
| 81 | +- **LLM-as-a-Judge**: Automated evaluation of agent responses using GPT-4 |
| 82 | +- **Concurrent Evaluations**: Support for multiple agents and parallel evaluations |
| 83 | +- **Structured Logging**: All interactions logged as JSON for analysis |
| 84 | +- **Interactive CLI**: Built-in CLI for testing and server management |
| 85 | +- **Connection Management**: Robust handling of agent connections and disconnections |
| 86 | +- **Timeout Handling**: Configurable timeouts for RPC calls and evaluations |
| 87 | + |
| 88 | +### Agent Protocol |
| 89 | + |
| 90 | +Agents must implement: |
| 91 | +- WebSocket connection to server |
| 92 | +- JSON-RPC 2.0 protocol support |
| 93 | +- `Evaluate(task: string) -> string` method |
| 94 | +- "ready" message to signal availability for evaluations |
| 95 | + |
| 96 | +### Configuration |
| 97 | + |
| 98 | +All configuration is managed through environment variables and `src/config.js`. Key settings: |
| 99 | +- Server port and host |
| 100 | +- OpenAI API configuration |
| 101 | +- RPC timeouts |
| 102 | +- Logging levels and directories |
| 103 | +- Maximum concurrent evaluations |
0 commit comments