Skip to content

Commit 05e5a47

Browse files
authored
Feat/external eval server part 2 (#28)
This pull request implements a comprehensive refactoring of the external evaluation server infrastructure (part 2), introducing modular configuration components, enhanced tracing integration, and improved error handling across the system. **Key changes:** - Refactored settings dialog into modular components (VectorDatabaseConfig, TracingConfig, EvaluationConfig, ProviderConfig) - Enhanced evaluation agent with better tracing context management and chat evaluation support - Streamlined server configuration with model precedence handling and OpenAI Responses API compatibility
1 parent 4e3d99c commit 05e5a47

File tree

112 files changed

+4859
-420
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

112 files changed

+4859
-420
lines changed

config/gni/devtools_grd_files.gni

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -608,6 +608,10 @@ grd_files_bundled_sources = [
608608
"front_end/panels/ai_chat/ui/PromptEditDialog.js",
609609
"front_end/panels/ai_chat/ui/SettingsDialog.js",
610610
"front_end/panels/ai_chat/ui/EvaluationDialog.js",
611+
"front_end/panels/ai_chat/ui/components/TracingConfig.js",
612+
"front_end/panels/ai_chat/ui/components/EvaluationConfig.js",
613+
"front_end/panels/ai_chat/ui/components/VectorDatabaseConfig.js",
614+
"front_end/panels/ai_chat/ui/components/ProviderConfig.js",
611615
"front_end/panels/ai_chat/core/AgentService.js",
612616
"front_end/panels/ai_chat/core/State.js",
613617
"front_end/panels/ai_chat/core/Graph.js",
@@ -650,8 +654,8 @@ grd_files_bundled_sources = [
650654
"front_end/panels/ai_chat/common/page.js",
651655
"front_end/panels/ai_chat/common/WebSocketRPCClient.js",
652656
"front_end/panels/ai_chat/common/EvaluationConfig.js",
653-
"front_end/panels/ai_chat/evaluation/EvaluationProtocol.js",
654-
"front_end/panels/ai_chat/evaluation/EvaluationAgent.js",
657+
"front_end/panels/ai_chat/evaluation/remote/EvaluationProtocol.js",
658+
"front_end/panels/ai_chat/evaluation/remote/EvaluationAgent.js",
655659
"front_end/panels/ai_chat/tracing/TracingProvider.js",
656660
"front_end/panels/ai_chat/tracing/LangfuseProvider.js",
657661
"front_end/panels/ai_chat/tracing/TracingConfig.js",

eval-server/README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,26 @@ A WebSocket-based evaluation server for LLM agents using LLM-as-a-judge methodol
3434
- 🖥️ Interactive CLI for testing and management
3535
- ⚡ Support for concurrent agent evaluations
3636

37+
## OpenAI Compatible API
38+
39+
The server provides an OpenAI-compatible `/v1/responses` endpoint for direct API access:
40+
41+
```bash
42+
curl -X POST 'http://localhost:8081/v1/responses' \
43+
-H 'Content-Type: application/json' \
44+
-d '{
45+
"input": "What is 2+2?",
46+
"main_model": "gpt-4.1",
47+
"mini_model": "gpt-4.1-nano",
48+
"nano_model": "gpt-4.1-nano",
49+
"provider": "openai"
50+
}'
51+
```
52+
53+
**Model Precedence:**
54+
1. **API calls** OR **individual test YAML models** (highest priority)
55+
2. **config.yaml defaults** (fallback when neither API nor test specify models)
56+
3757
## Agent Protocol
3858

3959
Your agent needs to:

eval-server/docs/TRIGGERING_EVALUATIONS.md

Lines changed: 3 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -98,36 +98,8 @@ curl -X POST http://localhost:8081/evaluate \\
9898
}'
9999
```
100100

101-
## Method 3: Automatic Scheduling (YAML Configuration)
102101

103-
Evaluations can be configured to run automatically based on their schedule in the YAML file.
104-
105-
### Schedule Types
106-
107-
#### On-Demand (Manual Only)
108-
```yaml
109-
schedule:
110-
type: "on_demand"
111-
```
112-
Only runs when manually triggered.
113-
114-
#### Periodic (Automatic)
115-
```yaml
116-
schedule:
117-
type: "periodic"
118-
interval: 86400000 # Run every 24 hours (in milliseconds)
119-
```
120-
Runs automatically at the specified interval.
121-
122-
#### One-Time (Automatic)
123-
```yaml
124-
schedule:
125-
type: "once"
126-
run_at: "2024-12-25T09:00:00Z" # Run once at specific time
127-
```
128-
Runs once at the specified time.
129-
130-
## Method 4: Programmatic Integration
102+
## Method 3: Programmatic Integration
131103

132104
You can integrate the evaluation system into your own applications:
133105

@@ -186,7 +158,7 @@ result = trigger_evaluation(
186158
print(json.dumps(result, indent=2))
187159
```
188160

189-
## Method 5: Webhook Integration
161+
## Method 4: Webhook Integration
190162

191163
You can set up webhooks to trigger evaluations from external systems:
192164

@@ -299,7 +271,7 @@ WebSocket connection failed
299271

300272
## Best Practices
301273

302-
1. **Start Simple**: Begin with on-demand evaluations before setting up automation
274+
1. **Start Simple**: Begin with manual evaluations before setting up automation
303275
2. **Monitor Logs**: Always monitor logs when running evaluations
304276
3. **Test Connections**: Use the `status` command to verify everything is connected
305277
4. **Gradual Rollout**: Test individual evaluations before running batch operations

eval-server/docs/YAML_SCHEMA.md

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -73,13 +73,6 @@ Each evaluation in the `evaluations` array follows this structure:
7373
summary:
7474
type: "string"
7575

76-
# Scheduling configuration
77-
schedule:
78-
type: "on_demand" # on_demand|periodic|once
79-
# For periodic:
80-
interval: 3600000 # Interval in milliseconds
81-
# For once:
82-
run_at: "2024-01-01T00:00:00Z" # ISO timestamp
8376

8477
# Validation configuration
8578
validation:
@@ -233,9 +226,6 @@ evaluations:
233226
lastModified:
234227
type: "string"
235228

236-
schedule:
237-
type: "periodic"
238-
interval: 86400000 # Daily
239229

240230
validation:
241231
type: "hybrid"
@@ -275,8 +265,6 @@ evaluations:
275265
include_sources: true
276266
depth: "moderate"
277267

278-
schedule:
279-
type: "on_demand"
280268

281269
validation:
282270
type: "llm-judge"
@@ -301,7 +289,6 @@ evaluations:
301289
3. **Tool names**: Must match registered tools in the client
302290
4. **URLs**: Must be valid HTTP/HTTPS URLs
303291
5. **Timeouts**: Must be positive integers (milliseconds)
304-
6. **Schedule intervals**: Must be at least 60000ms (1 minute)
305292
306293
## YAML Best Practices
307294
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Accessibility action test
2+
id: "a11y-001"
3+
name: "Click Using ARIA Label"
4+
description: "Test clicking an element identified primarily by ARIA attributes"
5+
enabled: true
6+
7+
target:
8+
url: "https://www.w3.org/WAI/ARIA/apg/patterns/button/examples/button/"
9+
wait_for: "networkidle"
10+
wait_timeout: 5000
11+
12+
tool: "action_agent"
13+
timeout: 60000
14+
15+
input:
16+
objective: "Click the button with aria-label \"Print Page\""
17+
reasoning: "Testing action selection using accessibility attributes"
18+
19+
validation:
20+
type: "llm-judge"
21+
llm_judge:
22+
model: "gpt-4.1-mini"
23+
temperature: 0.3
24+
criteria:
25+
- "Used accessibility tree to find elements"
26+
- "Correctly identified element by ARIA label"
27+
- "Successfully clicked the target button"
28+
- "Demonstrated understanding of accessibility attributes"
29+
- "No reliance on visual appearance alone"
30+
visual_verification:
31+
enabled: true
32+
capture_before: true
33+
capture_after: true
34+
prompts:
35+
- "Verify the Print Page button was successfully clicked"
36+
- "Check if any print dialog or print preview appeared"
37+
- "Confirm the button showed visual feedback (pressed state)"
38+
- "Ensure the action was performed on the correct accessibility-labeled element"
39+
40+
metadata:
41+
tags: ["action", "accessibility", "aria", "click", "a11y"]
42+
priority: "high"
43+
timeout: 60000
44+
retries: 2
45+
flaky: false
46+
owner: "devtools-team"
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Accordion expansion test
2+
id: "accordion-001"
3+
name: "Expand Accordion Section"
4+
description: "Test clicking to expand an accordion panel"
5+
enabled: true
6+
7+
target:
8+
url: "https://jqueryui.com/accordion/"
9+
wait_for: "networkidle"
10+
wait_timeout: 5000
11+
12+
tool: "action_agent"
13+
timeout: 60000
14+
15+
input:
16+
objective: "Click to expand the \"Section 2\" accordion panel"
17+
reasoning: "Testing accordion expand/collapse interaction"
18+
19+
validation:
20+
type: "llm-judge"
21+
llm_judge:
22+
model: "gpt-4.1-mini"
23+
temperature: 0.3
24+
criteria:
25+
- "Located the Section 2 accordion header"
26+
- "Successfully clicked to expand the section"
27+
- "Section 2 content became visible"
28+
- "Other sections collapsed appropriately"
29+
- "Accordion animation completed smoothly"
30+
visual_verification:
31+
enabled: true
32+
capture_before: true
33+
capture_after: true
34+
prompts:
35+
- "Verify Section 2 is now expanded and content visible"
36+
- "Check if other accordion sections collapsed"
37+
- "Confirm the expansion animation completed"
38+
- "Ensure Section 2 header shows expanded state"
39+
40+
metadata:
41+
tags: ["action", "accordion", "expand", "collapse", "ui"]
42+
priority: "high"
43+
timeout: 60000
44+
retries: 2
45+
flaky: false
46+
owner: "devtools-team"
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Autocomplete search test
2+
id: "autocomplete-001"
3+
name: "Use Autocomplete Search"
4+
description: "Test typing in autocomplete field and selecting from suggestions"
5+
enabled: true
6+
7+
target:
8+
url: "https://jqueryui.com/autocomplete/"
9+
wait_for: "networkidle"
10+
wait_timeout: 5000
11+
12+
tool: "action_agent"
13+
timeout: 60000
14+
15+
input:
16+
objective: "Type \"Java\" in the autocomplete field and select \"JavaScript\" from suggestions"
17+
reasoning: "Testing autocomplete/typeahead interaction patterns"
18+
19+
validation:
20+
type: "llm-judge"
21+
llm_judge:
22+
model: "gpt-4.1-mini"
23+
temperature: 0.3
24+
criteria:
25+
- "Located the autocomplete input field"
26+
- "Typed \"Java\" to trigger suggestions"
27+
- "Autocomplete dropdown appeared with suggestions"
28+
- "Selected \"JavaScript\" from the suggestion list"
29+
- "Input field shows the selected value"
30+
visual_verification:
31+
enabled: true
32+
capture_before: true
33+
capture_after: true
34+
prompts:
35+
- "Verify \"JavaScript\" appears in the input field"
36+
- "Check if autocomplete suggestions appeared"
37+
- "Confirm the correct suggestion was selected"
38+
- "Ensure dropdown closed after selection"
39+
40+
metadata:
41+
tags: ["action", "autocomplete", "typeahead", "search", "suggestions"]
42+
priority: "high"
43+
timeout: 60000
44+
retries: 2
45+
flaky: false
46+
owner: "devtools-team"
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Checkbox/radio button test
2+
id: "checkbox-001"
3+
name: "Toggle Newsletter Checkbox"
4+
description: "Test clicking checkbox elements for form options"
5+
enabled: true
6+
7+
target:
8+
url: "https://www.w3schools.com/html/tryit.asp?filename=tryhtml_checkbox"
9+
wait_for: "networkidle"
10+
wait_timeout: 5000
11+
12+
tool: "action_agent"
13+
timeout: 45000
14+
15+
input:
16+
objective: "Click the checkbox labeled \"I have a bike\" to check it"
17+
reasoning: "Testing interaction with checkbox form elements"
18+
19+
validation:
20+
type: "llm-judge"
21+
llm_judge:
22+
model: "gpt-4.1-mini"
23+
temperature: 0.3
24+
criteria:
25+
- "Identified the correct checkbox among multiple options"
26+
- "Used click action on the checkbox element"
27+
- "Checkbox state changed from unchecked to checked"
28+
- "Handled the iframe structure if present"
29+
- "No errors with form element interaction"
30+
visual_verification:
31+
enabled: true
32+
capture_before: true
33+
capture_after: true
34+
prompts:
35+
- "Compare screenshots to verify the checkbox state changed from unchecked to checked"
36+
- "Confirm the \"I have a bike\" checkbox now shows a checkmark"
37+
- "Verify the checkbox visual indicator (checkmark) is clearly visible"
38+
- "Ensure no other checkboxes were accidentally modified"
39+
40+
metadata:
41+
tags: ["action", "checkbox", "form", "w3schools", "input"]
42+
priority: "high"
43+
timeout: 60000
44+
retries: 2
45+
flaky: false
46+
owner: "devtools-team"
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Toggle checkbox test - using HTML form test site
2+
id: "checkbox-002"
3+
name: "Check Extra Cheese Checkbox"
4+
description: "Test checking a specific checkbox using the check method"
5+
enabled: true
6+
7+
target:
8+
url: "https://httpbin.org/forms/post"
9+
wait_for: "networkidle"
10+
wait_timeout: 5000
11+
12+
tool: "action_agent"
13+
timeout: 45000
14+
15+
input:
16+
objective: "Find and check the \"Extra Cheese\" checkbox in the Pizza Toppings section"
17+
reasoning: "Testing checkbox interaction functionality using check method"
18+
hint: "Look for the Extra Cheese checkbox and use the check method to select it"
19+
20+
validation:
21+
type: "llm-judge"
22+
llm_judge:
23+
model: "gpt-4.1-mini"
24+
temperature: 0.3
25+
criteria:
26+
- "Located the Extra Cheese checkbox in the Pizza Toppings section"
27+
- "Used the check method instead of click for better reliability"
28+
- "Checkbox became checked (if it wasn't already)"
29+
- "No errors occurred during checkbox interaction"
30+
- "Form maintained its structure after checkbox selection"
31+
visual_verification:
32+
enabled: true
33+
capture_before: true
34+
capture_after: true
35+
prompts:
36+
- "Verify the Extra Cheese checkbox is now checked (shows checkmark)"
37+
- "Check that the checkbox shows proper visual feedback for checked state"
38+
- "Confirm the form structure remained intact"
39+
- "Ensure the checkbox for Extra Cheese was specifically targeted and checked"
40+
41+
metadata:
42+
tags: ["action", "checkbox", "check", "form", "httpbin"]
43+
priority: "high"
44+
timeout: 45000
45+
retries: 2
46+
flaky: false
47+
owner: "devtools-team"

0 commit comments

Comments
 (0)