This repository contains benchmark evaluation infrastructure for OpenHands agents. It provides standardized evaluation pipelines for testing agent capabilities across various real-world tasks.
| Benchmark | Description | Status |
|---|---|---|
| SWE-Bench | Software engineering tasks from GitHub issues | ✅ Active |
| GAIA | General AI assistant tasks requiring multi-step reasoning | ✅ Active |
See the individual benchmark directories for detailed usage instructions.
Before running any benchmarks, you need to set up the environment and ensure the local Agent SDK submodule is initialized.
make build📦 Submodule & Environment Setup (click to expand)
The Benchmarks project uses a local git submodule for the OpenHands Agent SDK.
This ensures your code runs against a specific, reproducible commit.
Run once after cloning (already done in make build for you):
git submodule update --init --recursiveThis command will:
- clone the SDK into
vendor/software-agent-sdk/ - check out the exact commit pinned by this repo
- make it available for local development (
uv syncwill install from the local folder)
If you ever clone this repository again, remember to re-initialize the submodule with the same command.
Once the submodule is set up, install dependencies via uv:
make buildThis runs:
uv syncand ensures the openhands-* packages (SDK, tools, workspace, agent-server) are installed from the local workspace declared in pyproject.toml.
If you want to update to a newer version of the SDK:
cd vendor/software-agent-sdk
git fetch
git checkout <new_commit_or_branch>
cd ../..
git add vendor/software-agent-sdk
git commit -m "Update software-agent-sdk submodule to <new_commit_sha>"Then re-run:
make buildto rebuild your environment with the new SDK code.
All benchmarks require an LLM configuration file. Define your LLM config as a JSON following the model fields in the LLM class.
Example (.llm_config/example.json):
{
"model": "litellm_proxy/anthropic/claude-sonnet-4-20250514",
"base_url": "https://llm-proxy.eval.all-hands.dev",
"api_key": "YOUR_API_KEY_HERE"
}Validate your configuration:
uv run validate-cfg .llm_config/YOUR_CONFIG_PATH.jsonAfter setting up the environment and configuring your LLM, see the individual benchmark directories for specific usage instructions.
Benchmarks support two workspace types for running evaluations:
Uses local Docker containers to run agent evaluations. Images are built locally on-demand.
- Pros: No additional setup required, works offline
- Cons: Resource-intensive on local machine, slower for large-scale evaluations
- Use case: Development, testing, small-scale evaluations
Uses a remote runtime API to provision containers in a cloud environment, enabling massive parallelization.
- Pros: Scalable to hundreds of parallel workers, no local resource constraints
- Cons: Requires pre-built images and API access
- Use case: Large-scale evaluations, benchmarking runs
-
Pre-build Agent Images: Agent-server images must be pre-built for a specific SDK commit (SHA) and pushed to a public container registry (e.g.,
ghcr.io/openhands/eval-agent-server) -
Runtime API: The remote workspace connects to a runtime API service (default:
https://runtime.eval.all-hands.dev) that provisions containers on-demand -
Image Resolution: Before starting evaluation, the system verifies that the required image exists in the registry with the correct tag format:
{IMAGE}:{SDK_SHA}-{CUSTOM_TAG}{SUFFIX} -
Parallel Execution: Each evaluation instance runs in its own isolated container, allowing for massive parallelization (e.g., 32+ concurrent workers)
-
Pre-built Images: Images must be built and pushed to a public registry
- In this repository, add the
build-swebenchlabel to a PR to trigger image builds - Images are tagged with the SDK SHA from the
vendor/software-agent-sdksubmodule
- In this repository, add the
-
Runtime API Key: Set the
RUNTIME_API_KEYenvironment variableexport RUNTIME_API_KEY="your-api-key-here"
-
Optional Configuration:
RUNTIME_API_URL: Override the default API endpoint (default:https://runtime.eval.all-hands.dev)SDK_SHORT_SHA: Override the SDK SHA for image selection (default: auto-detected from submodule)
See individual benchmark READMEs for specific usage examples.
- Original OpenHands: https://github.com/OpenHands/OpenHands/
- Agent SDK: https://github.com/OpenHands/software-agent-sdk
- SWE-Bench: https://www.swebench.com/