Thank you for your interest in contributing to openbench! We welcome contributions from the community and are grateful for your support in making language model evaluation more accessible and reliable.
- Python 3.10+
- UV package manager
- Git
# Clone and setup
git clone https://github.com/groq/openbench.git
cd openbench
uv venv && uv sync --dev
source .venv/bin/activate
# CRITICAL: Install pre-commit hooks (CI will fail without this!)
pre-commit install
# Run tests to verify setup
pytestuv sync --dev. CI will fail if you skip this step!
Some benchmarks require additional dependencies that are not included in the core package. Use UV dependency groups to install them when needed:
# Install core dependencies only (runs most benchmarks)
uv sync
# Install specific benchmark dependencies
uv sync --group scicode # For SciCode benchmark
uv sync --group jsonschemabench # For JSONSchemaBench
# Install everything including dev tools and optional benchmark groups
uv sync --all-groupsEach PR must address a single concern. This helps us:
- Review changes more effectively
- Maintain a clean git history
- Easily revert changes if needed
- Understand the purpose of each change
Examples of single-concern PRs:
- ✅ Add support for a new benchmark
- ✅ Fix a specific bug in the MMLU scorer
- ✅ Refactor the math canonicalization utility
- ❌ Add new benchmark AND fix unrelated bug
- ❌ Refactor multiple unrelated components
We value clean, modular code with clear boundaries between components:
- Each module should have a single, well-defined purpose
- Avoid tight coupling between components
- Use dependency injection where appropriate
We use Conventional Commits for all commit messages. This provides a clear, structured way to communicate changes.
Format: <type>(<scope>): <subject>
feat: New featurefix: Bug fixdocs: Documentation changesstyle: Code style changes (formatting, missing semicolons, etc.)refactor: Code refactoring without changing functionalityperf: Performance improvementstest: Adding or updating testsbuild: Build system or dependency changesci: CI/CD configuration changeschore: Other changes that don't modify src or test files
feat(mmlu): add support for MMLU-Pro benchmark
fix(scorer): handle edge case in math canonicalization
docs(readme): update installation instructions
refactor(humaneval): extract common sandbox logic
test(gpqa): add unit tests for diamond scorer
perf(eval): optimize parallel sample processingThe scope should indicate the component or area affected:
- Benchmark names:
mmlu,humaneval,gpqa, etc. - Components:
cli,scorer,solver,common - Infrastructure:
docker,ci,deps
For complex changes, add a body to explain:
- What changed and why
- Any breaking changes
- Related issues
Example:
feat(aime): add support for AIME 2025 problems
- Add dataset loader for AIME 2025
- Update math scorer to handle new problem formats
- Include official solutions for verification
Closes #123
- Check existing issues and PRs to avoid duplicates
- For significant changes, open an issue first to discuss
- Fork the repository and create a feature branch
-
Create a feature branch
git checkout -b feat/add-new-benchmark
-
Make your changes
- Follow the existing code style
- Add tests for new functionality
- Update documentation as needed
-
Test your changes
# Run all tests pytest # Run integration tests (requires API keys) pytest -m integration # Run pre-commit hooks (REQUIRED) pre-commit run --all-files # Test your specific changes bench eval <your-benchmark> --limit 5
-
Commit with conventional commits
git commit -m "feat(benchmark): add support for XYZ benchmark"
-
Push to your fork
git push origin feat/add-new-benchmark
-
Create a Pull Request
- Use a clear, descriptive title following conventional commit format
- Fill out the PR template completely
- Link any related issues
- Ensure all CI checks pass
-
PR Title Format Since we use squash and merge, your PR title becomes the commit message. Use conventional commit format:
- ✅
feat(mmlu): add MMLU-Pro support - ✅
fix(cli): handle missing API key gracefully - ❌
Updated MMLU benchmark - ❌
Various fixes
- ✅
- A maintainer will review your PR
- Address any feedback or requested changes
- Once approved, we'll squash and merge your PR
- Your contribution will be part of the next release!
- Create a new evaluation task in
src/openbench/evals/ - Add dataset loader in
src/openbench/datasets/if needed - Add custom scorer in
src/openbench/scorers/if needed - Add custom metric in
src/openbench/metrics/if needed - Add custom solver in
src/openbench/solvers/if needed - Register benchmark metadata in
src/openbench/config.py - Import your task in
src/openbench/_registry.py:
Example File Structure:
openbench/
├── src/openbench/
│ ├── _cli/ # CLI commands
│ │ ├── list.py
│ │ ├── describe.py
│ │ ├── eval.py
│ │ └── view.py
│ ├── datasets/ # Data loaders
│ │ ├── mmlu.py
│ │ ├── humaneval.py
│ │ └── ...
│ ├── evals/ # Benchmark tasks
│ │ ├── mmlu.py
│ │ ├── humaneval.py
│ │ └── ...
│ ├── metrics/ # Custom metrics
│ │ ├── mmlu.py
│ │ ├── humaneval.py
│ │ └── ...
│ ├── scorers/ # Scoring functions
│ │ ├── mmlu.py
│ │ ├── humaneval.py
│ │ └── ...
│ ├── solvers/ # Solver functions
│ │ ├── mmlu.py
│ │ ├── humaneval.py
│ │ └── ...
│ ├── utils/ # Utilities
│ ├── _registry.py # Task registry
│ └── config.py # Configuration
├── tests/ # Test suite
├── pyproject.toml # Package config
└── README.md
openbench uses a tightly coupled architecture where benchmarks share common infrastructure:
- Core dependencies (inspect-ai, datasets, scipy, numpy): Required by multiple benchmarks
- Optional dependencies: Specific to individual benchmarks (e.g., scicode, jsonschema)
- Most benchmarks can run with just core dependencies
- Create provider file in
src/openbench/model/_providers/ - Follow existing provider patterns (see
ai21.py,cerebras.py, etc.) - Add environment variable documentation
- Test with multiple benchmarks
- Update provider table in README.md
- UV: Package manager (not pip) - use
uv add "package>=version"for dependencies (except inspect-ai which should remain pinned) - Ruff: Linting and formatting - replaces Black, isort, flake8
- MyPy: Type checking - required for all new code
- Pre-commit: Automated code quality checks - must pass before commits
- Pytest: Testing framework with integration test markers
- Follow PEP 8 with a line length of 88 characters (Black default)
- Use type hints for all function signatures
- Write docstrings for all public functions and classes
- Prefer composition over inheritance
- Keep functions small and focused
- Write unit tests for all new functionality
- Include integration tests for new benchmarks
- Aim for high test coverage
- Test edge cases and error conditions
We have structured issue templates to help you report problems effectively:
Use our bug report template which includes:
- openbench version and environment details
- Exact command that failed
- Expected vs actual behavior
- Error logs and reproduction steps
Use our feature request template for:
- New benchmarks/evaluations
- New model providers
- CLI enhancements
- Performance improvements
- API/SDK features
- Integration requests
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to:
- Be respectful and inclusive
- Welcome newcomers and help them get started
- Focus on constructive criticism
- Respect differing viewpoints and experiences
By contributing to openbench, you agree that your contributions will be licensed under the MIT License.