Eval Protocol (EP) is an open solution for doing reinforcement learning fine-tuning on existing agents — across any language, container, or framework. This quickstart uses it to evaluate and fine-tune VLM browser agents using Kernel serverless browsers and Fireworks for VLM inference.
Requires Python 3.10+.
-
Clone and install
git clone https://github.com/kernel/kernel-eval-protocol-quickstart.git cd kernel-eval-protocol-quickstart python -m venv .venv source .venv/bin/activate uv pip install -r requirements.txt
-
Set API keys
Copy
.env.exampleto.envand fill in the three keys: Kernel (serverless browser), Fireworks (VLM inference), OpenAI (WebJudge scoring).cp .env.example .env # Edit .env with your keys -
Create a browser pool
Browsers must stay alive during VLM inference, so use a long inactivity timeout. The default concurrency uses up to 16 rollouts, so a pool of 50 is a good fit.
kernel pools create eval-browser-pool --size 50 --timeout 1800 --stealth --fill-rate 25
-
Start the local monitoring server
In a separate terminal, start the Eval Protocol UI so you can monitor runs in real-time:
source .venv/bin/activate .venv/bin/ep logsKeep this running -- when you kick off pytest in the next step, open
http://localhost:8000to watch progress, view live results, and explore the pivot/table views that pytest prints to the console. -
Run the evaluation
pytest test_agent_auth.py -vs
By default, the test runs 4 rollouts. At most 16 rollouts and 16 evaluations run in parallel by default; use the flags below to change the row count or concurrency:
- More rows:
pytest test_agent_auth.py -vs --ep-max-rows=20 - Limit concurrent browser rollouts (e.g. groups of 5):
--ep-max-concurrent-rollouts=5 - Limit concurrent WebJudge evaluations:
--ep-max-concurrent-evaluations=5
- More rows:
The included dataset (tasks.jsonl) contains 469 Agent Auth tasks. Each Agent Auth task asks the agent to navigate to a website, find its login or registration page, and identify the required input fields -- without typing credentials or submitting any forms. For each task:
- KernelBrowserRolloutProcessor acquires a browser from the pool, navigates to the task URL, runs the VLM agent loop (screenshot → predict → execute → repeat), captures the trajectory, then releases the browser.
- The test function scores each trajectory with WebJudge (LLM-as-judge) against the evaluation rubric in
agent_auth/config.py. - Results are reported by pytest / Eval Protocol.
┌─────────────────────────────────────────────────────────────────┐
│ Eval Protocol │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ @evaluation_test(...) │ │
│ │ async def test_agent_auth(row): │ │
│ │ trajectory = get_trajectory(row) │ │
│ │ score = webjudge.evaluate(trajectory) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ KernelBrowserRolloutProcessor │ │
│ │ 1. Acquire browser from Kernel pool │ │
│ │ 2. Navigate to initial URL │ │
│ │ 3. Run agent loop (screenshot → predict → execute) │ │
│ │ 4. Capture trajectory, release browser │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ Kernel Browser Pool │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ 🌐 │ │ 🌐 │ │ 🌐 │ │
│ └─────┘ └─────┘ └─────┘ │
└─────────────────────────────┘
RFT produces a smaller model trained specifically on the browser-agent actions that work for your tasks, so you can run cheaper inference without losing task performance. Create a reinforcement fine-tuning job from evaluation results:
- Run
pytest test_agent_auth.py -vsto generate Eval Protocol results from your task dataset. - Eval scoring uses
AGENT_AUTH_EVALUATION_CRITERIA(via WebJudge) to produce the success/failure signal. - Run
ep create rft ...to build the training dataset from those evaluation results and start an RFT job. - After training completes, evaluate the new model again with the same
test_agent_auth.pyflow.
ep create rft --base-model accounts/fireworks/models/qwen3-vl-8b-instruct --chunk-size 50 --max-context-length 32768 --batch-size 32768 --epochs 4When you change your evaluation code (e.g. test_agent_auth.py, prompts, or WebJudge config), upload the updated evaluator so Fireworks uses it for RFT jobs and remote runs:
ep upload --force -y--forceoverwrites the existing evaluator with the same ID.-yruns non-interactively (no prompts).
Local pytest always uses your local code; only Fireworks (e.g. RFT job validation) uses the uploaded version.
After the RFT job completes, you get a new model ID (e.g. from Fireworks). To evaluate that model instead of the default, set it in test_agent_auth.py in the @evaluation_test decorator:
completion_params=[
{"model": "accounts/fireworks/models/your-rft-model-id"},
],Then run the evaluation as usual: pytest test_agent_auth.py -vs.
Use KernelBrowserRolloutProcessor with your own dataset and scorer:
from eval_protocol.pytest import evaluation_test
from eval_protocol.models import EvaluateResult
from kernel_browser_rollout_processor import (
KernelBrowserRolloutProcessor,
decode_screenshots,
)
from core.reward_models.webjudge import Trajectory, WebJudge
from agent_auth.actions import AGENT_AUTH_ACTIONS
from agent_auth.config import get_agent_auth_system_prompt
@evaluation_test(
input_dataset=["your_tasks.jsonl"],
rollout_processor=KernelBrowserRolloutProcessor(
pool_name="your-pool",
max_steps=15,
system_prompt=get_agent_auth_system_prompt(),
extra_actions=AGENT_AUTH_ACTIONS,
),
completion_params=[{"model": "accounts/fireworks/models/qwen3-vl-30b-a3b-thinking"}],
)
async def test_your_evaluation(row):
extra = row.execution_metadata.extra
screenshots = decode_screenshots(extra["screenshots_b64"])
actions = extra["action_history"]
messages = row.messages
score = your_scorer(screenshots, actions)
row.evaluation_result = EvaluateResult(score=score, reason="...")
return rowkernel-eval-protocol-quickstart/
├── core/
│ ├── agent.py
│ ├── agent_loop.py
│ ├── browser.py
│ ├── actions.py
│ ├── prompts.py
│ ├── tracking.py
│ ├── utils.py
│ └── reward_models/
│ ├── base.py
│ └── webjudge.py
├── agent_auth/
│ ├── actions.py
│ └── config.py
├── kernel_browser_rollout_processor.py
├── test_agent_auth.py
├── tasks.jsonl
├── requirements.txt
├── pytest.ini
└── README.md
- Eval Protocol — Pytest-based LLM evaluation framework
- Fireworks — VLM inference (e.g. Qwen3-VL)
- Kernel — Browser-as-a-service