OpenHands Benchmarks

This repository contains benchmark evaluation infrastructure for OpenHands agents. It provides standardized evaluation pipelines for testing agent capabilities across various real-world tasks.

⚠️ Migration in Progress: We are currently migrating the benchmarks from OpenHands V0 to work with the OpenHands Software Agent SDK infrastructure in V1.

Available Benchmarks

Benchmark	Description	Status
SWE-Bench	Software engineering tasks from GitHub issues	✅ Active
GAIA	General AI assistant tasks requiring multi-step reasoning	✅ Active

See the individual benchmark directories for detailed usage instructions.

Quick Start

Prerequisites

Before running any benchmarks, you need to set up the environment and ensure the local Agent SDK submodule is initialized.

make build

📦 Submodule & Environment Setup (click to expand)

🧩 1. Initialize the Agent SDK submodule

The Benchmarks project uses a local git submodule for the OpenHands Agent SDK.
This ensures your code runs against a specific, reproducible commit.

Run once after cloning (already done in make build for you):

git submodule update --init --recursive

This command will:

clone the SDK into vendor/software-agent-sdk/
check out the exact commit pinned by this repo
make it available for local development (uv sync will install from the local folder)

If you ever clone this repository again, remember to re-initialize the submodule with the same command.

🏗️ 2. Build the environment

Once the submodule is set up, install dependencies via uv:

make build

This runs:

uv sync

and ensures the openhands-* packages (SDK, tools, workspace, agent-server) are installed from the local workspace declared in pyproject.toml.

🔄 3. Update the submodule (when SDK changes)

If you want to update to a newer version of the SDK:

cd vendor/software-agent-sdk
git fetch
git checkout <new_commit_or_branch>
cd ../..
git add vendor/software-agent-sdk
git commit -m "Update software-agent-sdk submodule to <new_commit_sha>"

Then re-run:

make build

to rebuild your environment with the new SDK code.

Configuration

Configure Your LLM

All benchmarks require an LLM configuration file. Define your LLM config as a JSON following the model fields in the LLM class.

Example (.llm_config/example.json):

{
  "model": "litellm_proxy/anthropic/claude-sonnet-4-20250514",
  "base_url": "https://llm-proxy.eval.all-hands.dev",
  "api_key": "YOUR_API_KEY_HERE"
}

Validate your configuration:

uv run validate-cfg .llm_config/YOUR_CONFIG_PATH.json

Running Benchmarks

After setting up the environment and configuring your LLM, see the individual benchmark directories for specific usage instructions.

Workspace Types

Benchmarks support two workspace types for running evaluations:

Docker Workspace (Default)

Uses local Docker containers to run agent evaluations. Images are built locally on-demand.

Pros: No additional setup required, works offline
Cons: Resource-intensive on local machine, slower for large-scale evaluations
Use case: Development, testing, small-scale evaluations

Remote Workspace

Uses a remote runtime API to provision containers in a cloud environment, enabling massive parallelization.

Pros: Scalable to hundreds of parallel workers, no local resource constraints
Cons: Requires pre-built images and API access
Use case: Large-scale evaluations, benchmarking runs

How Remote Runtime Works

Pre-build Agent Images: Agent-server images must be pre-built for a specific SDK commit (SHA) and pushed to a public container registry (e.g., ghcr.io/openhands/eval-agent-server)
Runtime API: The remote workspace connects to a runtime API service (default: https://runtime.eval.all-hands.dev) that provisions containers on-demand
Image Resolution: Before starting evaluation, the system verifies that the required image exists in the registry with the correct tag format: {IMAGE}:{SDK_SHA}-{CUSTOM_TAG}{SUFFIX}
Parallel Execution: Each evaluation instance runs in its own isolated container, allowing for massive parallelization (e.g., 32+ concurrent workers)

Prerequisites for Remote Workspace

Pre-built Images: Images must be built and pushed to a public registry
- In this repository, add the build-swebench label to a PR to trigger image builds
- Images are tagged with the SDK SHA from the vendor/software-agent-sdk submodule
Runtime API Key: Set the RUNTIME_API_KEY environment variable
```
export RUNTIME_API_KEY="your-api-key-here"
```
Optional Configuration:
- RUNTIME_API_URL: Override the default API endpoint (default: https://runtime.eval.all-hands.dev)
- SDK_SHORT_SHA: Override the SDK SHA for image selection (default: auto-detected from submodule)

See individual benchmark READMEs for specific usage examples.

Links

Original OpenHands: https://github.com/OpenHands/OpenHands/
Agent SDK: https://github.com/OpenHands/software-agent-sdk
SWE-Bench: https://www.swebench.com/

Name		Name	Last commit message	Last commit date
Latest commit History 202 Commits
.github		.github
.llm_config		.llm_config
.openhands		.openhands
benchmarks		benchmarks
tests		tests
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenHands Benchmarks

Available Benchmarks

Quick Start

Prerequisites

🧩 1. Initialize the Agent SDK submodule

🏗️ 2. Build the environment

🔄 3. Update the submodule (when SDK changes)

Configuration

Configure Your LLM

Running Benchmarks

Workspace Types

Docker Workspace (Default)

Remote Workspace

How Remote Runtime Works

Prerequisites for Remote Workspace

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

OpenHands/benchmarks

Folders and files

Latest commit

History

Repository files navigation

OpenHands Benchmarks

Available Benchmarks

Quick Start

Prerequisites

🧩 1. Initialize the Agent SDK submodule

🏗️ 2. Build the environment

🔄 3. Update the submodule (when SDK changes)

Configuration

Configure Your LLM

Running Benchmarks

Workspace Types

Docker Workspace (Default)

Remote Workspace

How Remote Runtime Works

Prerequisites for Remote Workspace

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages