Skip to content

Feat/evals improvements #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: feature/ui-tracing-enhancements
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions config/gni/devtools_grd_files.gni
Original file line number Diff line number Diff line change
Expand Up @@ -608,10 +608,6 @@ grd_files_bundled_sources = [
"front_end/panels/ai_chat/ui/PromptEditDialog.js",
"front_end/panels/ai_chat/ui/SettingsDialog.js",
"front_end/panels/ai_chat/ui/EvaluationDialog.js",
"front_end/panels/ai_chat/ui/components/TracingConfig.js",
"front_end/panels/ai_chat/ui/components/EvaluationConfig.js",
"front_end/panels/ai_chat/ui/components/VectorDatabaseConfig.js",
"front_end/panels/ai_chat/ui/components/ProviderConfig.js",
"front_end/panels/ai_chat/core/AgentService.js",
"front_end/panels/ai_chat/core/State.js",
"front_end/panels/ai_chat/core/Graph.js",
Expand Down
3 changes: 2 additions & 1 deletion eval-server/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
.env
node_modules
node_modules
*.log
266 changes: 219 additions & 47 deletions eval-server/README.md
Original file line number Diff line number Diff line change
@@ -1,67 +1,239 @@
# bo-eval-server
# Eval-Server

A WebSocket-based evaluation server for LLM agents using LLM-as-a-judge methodology.
A WebSocket-based evaluation server for LLM agents with multiple language implementations.

## Quick Start
## Overview

This directory contains two functionally equivalent implementations of the bo-eval-server:

- **NodeJS** (`nodejs/`) - Full-featured implementation with YAML evaluations, HTTP API, CLI, and judge system
- **Python** (`python/`) - Minimal library focused on core WebSocket functionality and programmatic evaluation creation

1. **Install dependencies**
```bash
npm install
```
Both implementations provide:
- 🔌 **WebSocket Server** - Real-time agent connections
- 🤖 **Bidirectional RPC** - JSON-RPC 2.0 for calling agent methods
- 📚 **Programmatic API** - Create and manage evaluations in code
-**Concurrent Support** - Handle multiple agents simultaneously
- 📊 **Structured Logging** - Comprehensive evaluation tracking

## Quick Start

2. **Configure environment**
```bash
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
```
### NodeJS (Full Featured)

3. **Start the server**
```bash
npm start
```
The NodeJS implementation includes YAML evaluation loading, HTTP API wrapper, CLI tools, and LLM-as-a-judge functionality.

4. **Use interactive CLI** (alternative to step 3)
```bash
npm run cli
```
```bash
cd nodejs/
npm install
npm start
```

## Features
**Key Features:**
- YAML evaluation file loading
- HTTP API wrapper for REST integration
- Interactive CLI for management
- LLM judge system for response evaluation
- Comprehensive documentation and examples

- 🔌 WebSocket server for real-time agent connections
- 🤖 Bidirectional RPC calls to connected agents
- ⚖️ LLM-as-a-judge evaluation using OpenAI GPT-4
- 📊 Structured JSON logging of all evaluations
- 🖥️ Interactive CLI for testing and management
- ⚡ Support for concurrent agent evaluations
See [`nodejs/README.md`](nodejs/README.md) for detailed usage.

## OpenAI Compatible API
### Python (Lightweight Library)

The server provides an OpenAI-compatible `/v1/responses` endpoint for direct API access:
The Python implementation focuses on core WebSocket functionality with programmatic evaluation creation.

```bash
curl -X POST 'http://localhost:8081/v1/responses' \
-H 'Content-Type: application/json' \
-d '{
"input": "What is 2+2?",
"main_model": "gpt-4.1",
"mini_model": "gpt-4.1-nano",
"nano_model": "gpt-4.1-nano",
"provider": "openai"
}'
cd python/
pip install -e .
python examples/basic_server.py
```

**Model Precedence:**
1. **API calls** OR **individual test YAML models** (highest priority)
2. **config.yaml defaults** (fallback when neither API nor test specify models)
**Key Features:**
- Minimal dependencies (websockets, loguru)
- Full async/await support
- Evaluation stack for LIFO queuing
- Type hints throughout
- Clean Pythonic API

See [`python/README.md`](python/README.md) for detailed usage.

## Architecture Comparison

| Feature | NodeJS | Python |
|---------|--------|--------|
| **Core WebSocket Server** |||
| **JSON-RPC 2.0** |||
| **Client Management** |||
| **Programmatic Evaluations** |||
| **Evaluation Stack** |||
| **Structured Logging** | ✅ (Winston) | ✅ (Loguru) |
| **YAML Evaluations** |||
| **HTTP API Wrapper** |||
| **CLI Interface** |||
| **LLM Judge System** |||
| **Type System** | TypeScript | Type Hints |

## Choosing an Implementation

**Choose NodeJS if you need:**
- YAML-based evaluation definitions
- HTTP REST API endpoints
- Interactive CLI for management
- LLM-as-a-judge evaluation
- Comprehensive feature set

**Choose Python if you need:**
- Minimal dependencies
- Pure programmatic approach
- Integration with Python ML pipelines
- Modern async/await patterns
- Lightweight deployment

## Agent Protocol

Your agent needs to:
Both implementations use the same WebSocket protocol:

### 1. Connect to WebSocket
```javascript
// NodeJS
const ws = new WebSocket('ws://localhost:8080');

// Python
import websockets
ws = await websockets.connect('ws://localhost:8080')
```

### 2. Send Registration
```json
{
"type": "register",
"clientId": "your-client-id",
"secretKey": "your-secret-key",
"capabilities": ["chat", "action"]
}
```

### 3. Send Ready Signal
```json
{
"type": "ready"
}
```

### 4. Handle RPC Calls
Both implementations send JSON-RPC 2.0 requests with the `evaluate` method:

```json
{
"jsonrpc": "2.0",
"method": "evaluate",
"params": {
"id": "eval_001",
"name": "Test Evaluation",
"tool": "chat",
"input": {"message": "Hello world"}
},
"id": "unique-call-id"
}
```

Agents should respond with:
```json
{
"jsonrpc": "2.0",
"id": "unique-call-id",
"result": {
"status": "completed",
"output": {"response": "Hello! How can I help you?"}
}
}
```

## Examples

### NodeJS Example
```javascript
import { EvalServer } from 'bo-eval-server';

const server = new EvalServer({
authKey: 'secret',
port: 8080
});

server.onConnect(async client => {
const result = await client.evaluate({
id: "test",
name: "Hello World",
tool: "chat",
input: {message: "Hi there!"}
});
console.log(result);
});

await server.start();
```

### Python Example
```python
import asyncio
from bo_eval_server import EvalServer

async def main():
server = EvalServer(
auth_key='secret',
port=8080
)

@server.on_connect
async def handle_client(client):
result = await client.evaluate({
"id": "test",
"name": "Hello World",
"tool": "chat",
"input": {"message": "Hi there!"}
})
print(result)

await server.start()
await server.wait_closed()

asyncio.run(main())
```

## Development

Each implementation has its own development setup:

**NodeJS:**
```bash
cd nodejs/
npm install
npm run dev # Watch mode
npm test # Run tests
npm run cli # Interactive CLI
```

**Python:**
```bash
cd python/
pip install -e ".[dev]"
pytest # Run tests
black . # Format code
mypy src/ # Type checking
```

## Contributing

When contributing to either implementation:

1. Maintain API compatibility between versions where possible
2. Update documentation for both implementations when adding shared features
3. Follow the existing code style and patterns
4. Add appropriate tests and examples

## License

1. Connect to the WebSocket server (default: `ws://localhost:8080`)
2. Send a `{"type": "ready"}` message when ready for evaluations
3. Implement the `Evaluate` RPC method that accepts a string task and returns a string response
MIT License - see individual implementation directories for details.

## For more details
---

See [CLAUDE.md](./CLAUDE.md) for comprehensive documentation of the architecture and implementation.
Both implementations provide robust, production-ready evaluation servers for LLM agents with different feature sets optimized for different use cases.
File renamed without changes.
Loading