Contributing to openbench

Thank you for your interest in contributing to openbench! We welcome contributions from the community and are grateful for your support in making language model evaluation more accessible and reliable.

🚀 Quick Start

Prerequisites

Python 3.10+
UV package manager
Git

Setup

# Clone and setup
git clone https://github.com/groq/openbench.git
cd openbench
uv venv && uv sync --dev
source .venv/bin/activate

# CRITICAL: Install pre-commit hooks (CI will fail without this!)
pre-commit install

# Run tests to verify setup
pytest

⚠️ IMPORTANT: You MUST install pre-commit hooks after uv sync --dev. CI will fail if you skip this step!

Installing Optional Dependencies

Some benchmarks require additional dependencies that are not included in the core package. Use UV dependency groups to install them when needed:

# Install core dependencies only (runs most benchmarks)
uv sync

# Install specific benchmark dependencies
uv sync --group scicode         # For SciCode benchmark
uv sync --group jsonschemabench  # For JSONSchemaBench

# Install everything including dev tools and optional benchmark groups
uv sync --all-groups

🎯 Core Principles

Single Responsibility

Each PR must address a single concern. This helps us:

Review changes more effectively
Maintain a clean git history
Easily revert changes if needed
Understand the purpose of each change

Examples of single-concern PRs:

✅ Add support for a new benchmark
✅ Fix a specific bug in the MMLU scorer
✅ Refactor the math canonicalization utility
❌ Add new benchmark AND fix unrelated bug
❌ Refactor multiple unrelated components

Clear Separation of Concerns (SoC)

We value clean, modular code with clear boundaries between components:

Each module should have a single, well-defined purpose
Avoid tight coupling between components
Use dependency injection where appropriate

📝 Commit Guidelines

Conventional Commits

We use Conventional Commits for all commit messages. This provides a clear, structured way to communicate changes.

Format: <type>(<scope>): <subject>

Types

feat: New feature
fix: Bug fix
docs: Documentation changes
style: Code style changes (formatting, missing semicolons, etc.)
refactor: Code refactoring without changing functionality
perf: Performance improvements
test: Adding or updating tests
build: Build system or dependency changes
ci: CI/CD configuration changes
chore: Other changes that don't modify src or test files

Examples

feat(mmlu): add support for MMLU-Pro benchmark
fix(scorer): handle edge case in math canonicalization
docs(readme): update installation instructions
refactor(humaneval): extract common sandbox logic
test(gpqa): add unit tests for diamond scorer
perf(eval): optimize parallel sample processing

Scope

The scope should indicate the component or area affected:

Benchmark names: mmlu, humaneval, gpqa, etc.
Components: cli, scorer, solver, common
Infrastructure: docker, ci, deps

Commit Message Body

For complex changes, add a body to explain:

What changed and why
Any breaking changes
Related issues

Example:

feat(aime): add support for AIME 2025 problems

- Add dataset loader for AIME 2025
- Update math scorer to handle new problem formats
- Include official solutions for verification

Closes #123

🔄 Pull Request Process

Before You Start

Check existing issues and PRs to avoid duplicates
For significant changes, open an issue first to discuss
Fork the repository and create a feature branch

Development Workflow

Create a feature branch
```
git checkout -b feat/add-new-benchmark
```
Make your changes
- Follow the existing code style
- Add tests for new functionality
- Update documentation as needed

Test your changes

# Run all tests
pytest

# Run integration tests (requires API keys)
pytest -m integration

# Run pre-commit hooks (REQUIRED)
pre-commit run --all-files

# Test your specific changes
bench eval <your-benchmark> --limit 5

Commit with conventional commits

git commit -m "feat(benchmark): add support for XYZ benchmark"

Submitting Your PR

Push to your fork
```
git push origin feat/add-new-benchmark
```
Create a Pull Request
- Use a clear, descriptive title following conventional commit format
- Fill out the PR template completely
- Link any related issues
- Ensure all CI checks pass
PR Title Format Since we use squash and merge, your PR title becomes the commit message. Use conventional commit format:
- ✅ feat(mmlu): add MMLU-Pro support
- ✅ fix(cli): handle missing API key gracefully
- ❌ Updated MMLU benchmark
- ❌ Various fixes

What Happens Next

A maintainer will review your PR
Address any feedback or requested changes
Once approved, we'll squash and merge your PR
Your contribution will be part of the next release!

🏗️ Architecture Guidelines

Adding a New Benchmark

Create a new evaluation task in src/openbench/evals/
Add dataset loader in src/openbench/datasets/ if needed
Add custom scorer in src/openbench/scorers/ if needed
Add custom metric in src/openbench/metrics/ if needed
Add custom solver in src/openbench/solvers/ if needed
Register benchmark metadata in src/openbench/config.py
Import your task in src/openbench/_registry.py:

Example File Structure:

openbench/
├── src/openbench/
│   ├── _cli/           # CLI commands
│   │   ├── list.py
│   │   ├── describe.py
│   │   ├── eval.py
│   │   └── view.py
│   ├── datasets/       # Data loaders
│   │   ├── mmlu.py
│   │   ├── humaneval.py
│   │   └── ...
│   ├── evals/          # Benchmark tasks
│   │   ├── mmlu.py
│   │   ├── humaneval.py
│   │   └── ...
│   ├── metrics/        # Custom metrics
│   │   ├── mmlu.py
│   │   ├── humaneval.py
│   │   └── ...
│   ├── scorers/        # Scoring functions
│   │   ├── mmlu.py
│   │   ├── humaneval.py
│   │   └── ...
│   ├── solvers/        # Solver functions
│   │   ├── mmlu.py
│   │   ├── humaneval.py
│   │   └── ...
│   ├── utils/          # Utilities
│   ├── _registry.py    # Task registry
│   └── config.py       # Configuration
├── tests/             # Test suite
├── pyproject.toml     # Package config
└── README.md

Dependency Architecture

openbench uses a tightly coupled architecture where benchmarks share common infrastructure:

Core dependencies (inspect-ai, datasets, scipy, numpy): Required by multiple benchmarks
Optional dependencies: Specific to individual benchmarks (e.g., scicode, jsonschema)
Most benchmarks can run with just core dependencies

Adding a New Model Provider

Create provider file in src/openbench/model/_providers/
Follow existing provider patterns (see ai21.py, cerebras.py, etc.)
Add environment variable documentation
Test with multiple benchmarks
Update provider table in README.md

Key Development Tools

UV: Package manager (not pip) - use uv add "package>=version" for dependencies (except inspect-ai which should remain pinned)
Ruff: Linting and formatting - replaces Black, isort, flake8
MyPy: Type checking - required for all new code
Pre-commit: Automated code quality checks - must pass before commits
Pytest: Testing framework with integration test markers

Code Style

Follow PEP 8 with a line length of 88 characters (Black default)
Use type hints for all function signatures
Write docstrings for all public functions and classes
Prefer composition over inheritance
Keep functions small and focused

Testing

Write unit tests for all new functionality
Include integration tests for new benchmarks
Aim for high test coverage
Test edge cases and error conditions

🐛 Reporting Issues

We have structured issue templates to help you report problems effectively:

Bug Reports

Use our bug report template which includes:

openbench version and environment details
Exact command that failed
Expected vs actual behavior
Error logs and reproduction steps

Feature Requests

Use our feature request template for:

New benchmarks/evaluations
New model providers
CLI enhancements
Performance improvements
API/SDK features
Integration requests

📚 Resources

🤝 Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to:

Be respectful and inclusive
Welcome newcomers and help them get started
Focus on constructive criticism
Respect differing viewpoints and experiences

📄 License

By contributing to openbench, you agree that your contributions will be licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing to openbench

🚀 Quick Start

Prerequisites

Setup

Installing Optional Dependencies

🎯 Core Principles

Single Responsibility

Clear Separation of Concerns (SoC)

📝 Commit Guidelines

Conventional Commits

Types

Examples

Scope

Commit Message Body

🔄 Pull Request Process

Before You Start

Development Workflow

Submitting Your PR

What Happens Next

🏗️ Architecture Guidelines

Adding a New Benchmark

Dependency Architecture

Adding a New Model Provider

Key Development Tools

Code Style

Testing

🐛 Reporting Issues

Bug Reports

Feature Requests

📚 Resources

🤝 Code of Conduct

📄 License

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing to openbench

🚀 Quick Start

Prerequisites

Setup

Installing Optional Dependencies

🎯 Core Principles

Single Responsibility

Clear Separation of Concerns (SoC)

📝 Commit Guidelines

Conventional Commits

Types

Examples

Scope

Commit Message Body

🔄 Pull Request Process

Before You Start

Development Workflow

Submitting Your PR

What Happens Next

🏗️ Architecture Guidelines

Adding a New Benchmark

Dependency Architecture

Adding a New Model Provider

Key Development Tools

Code Style

Testing

🐛 Reporting Issues

Bug Reports

Feature Requests

📚 Resources

🤝 Code of Conduct

📄 License