Benchmark Zoo is the definitive collection on every benchmark framework out there. We implemented the same simple dummy benchmark in 42 different perf test frameworks, load testers, even unit testing frameworks.
The outputs of each benchmarking tool are stored in tests/data.
Finally, we used those outputs to generate a parser to go with each of the benchmark frameworks. Therefore if you need to extract benchmark results from a log file or from a separate artifact file, then the goal is that this repository should always have the parser you need.
And you don't even need to tell benchzoo which parser to use.
Hand benchzoo.sniff(content) any benchmark output — all 42
supported frameworks are autodetected from content alone, via a
four-tier signature matcher (JSON top-level keys → XML root element →
CSV header row → distinctive text substring):
asv, benchmark-ips, benchmark-js, benchmarkdotnet, benchmarktools-jl, cargo-bench, catch2, clickbench, custom-csv, custom-json, dotnet-test, gatling, go-test-bench, google-benchmark, hey, hyperfine, jmeter, jmh, junit-go, junit-standard, k6, lighthouse, locust, memtier, mitata, mocha, perf-stat, pgbench, phpbench, playwright, pytest-benchmark, redis-benchmark, sysbench, time, tinybench, vegeta, vitest-bench, wrk, wrk2.
junit-standard covers jest-junit, Maven Surefire / vanilla Java
JUnit, CTest and Catch2's junit reporter — all four emit structurally
indistinguishable <testsuite> XML, and a single shared parser reads
testcase name + time verbatim. gotestsum's junit XML has the same
shape but is distinguished by its <property name="go.version" ...>
fingerprint, which routes to junit-go's Test-prefix-stripping
parser.
Hard invariant: sniff never returns a wrong framework. When content
is genuinely ambiguous it returns None and you fall through to
explicit selection or the LLM parsers below.
As a catch-all, there are also two LLM-backed parsers that extract benchmark results from arbitrary input by prompting an LLM with benchzoo's output schema plus a few worked examples from the corpus:
llm_anthropic— calls the Anthropic API directly. Highest accuracy; requiresANTHROPIC_API_KEYandpip install anthropic.llm_local— calls a local Ollama instance. Default model isqwen2.5-coder:3b; also works with other small coder- or instruction-tuned models (Llama 3.2, Phi-3.5, Gemma 2). Offline; deterministic withtemperature=0.
Both are experimental and non-deterministic by nature. Not appropriate for production change-detection pipelines; great for triage, one-off proprietary formats, or bootstrapping a new deterministic parser.
42 frameworks end-to-end (sample benchmark + workflow +
real CI-captured fixture + parser + ground-truth tests); 370+
passing parser tests. 32 frameworks parse plain stdout / CI-log
output (the *_text parsers) — so results printed to a job log with
no uploaded artifact still ingest. The library is pip install -e .-able;
the corpus runs on every push.
Pre-1.0: parser API may still move; not yet on PyPI.
pip install -e .import benchzoo
content = open("output.json").read()
# 1. Explicit — best when you have out-of-band info (e.g. a CI
# artifact name, a config setting) that tells you which
# framework produced `content`.
results = benchzoo.find_parser("hyperfine", "json").parse(content)
# 2. Sniff-based — best-effort content detection. Returns the
# framework name (string) or None. Never returns a wrong name:
# when the input is ambiguous, it's None and the caller decides
# what to do next.
framework = benchzoo.sniff(content) # e.g. "hyperfine" or None
if framework:
results = benchzoo.find_parser(framework).parse(content)
else:
# 3. Catch-all — hand off to the LLM fallback parsers for
# formats benchzoo doesn't have a dedicated parser for.
from benchzoo.parsers import llm_local
results = llm_local.parse(content)
# Browse the registry directly if you're wiring up your own
# dispatcher.
for framework, formats in benchzoo.PARSERS.items():
print(framework, list(formats))Every parser has the same contract: parse(content: bytes | str) -> list[dict], returning benchzoo's common output schema described in
docs/design.md.
pip install -e ".[dev]"
pytestLLM parser tests skip by default; set BENCHZOO_RUN_LLM_ANTHROPIC=1
(+ ANTHROPIC_API_KEY) or BENCHZOO_RUN_LLM_LOCAL=1 to enable them.
| Language | Framework | Parser(s) |
|---|---|---|
| Rust | criterion | criterion_estimates, criterion_bencher |
| Rust | cargo bench (libtest) | cargo_bench_libtest (shares criterion_bencher) |
| C++ | Google Benchmark | google_benchmark_json, google_benchmark_csv |
| C++ | Catch2 | catch2_xml, junit_standard, catch2_text |
| Java/JVM | JMH | jmh_json, jmh_csv, jmh_text |
| C# / .NET | BenchmarkDotNet | benchmarkdotnet_json, benchmarkdotnet_csv, benchmarkdotnet_text |
| Go | go test -bench |
go_bench_text, go_bench_json |
| Python | pytest-benchmark | pytest_benchmark_json, junit_pytest, pytest_benchmark_text |
| Python | asv (airspeed velocity) | asv, asv_text |
| JS | benchmark.js | benchmark_js |
| JS | tinybench | tinybench |
| JS | mitata | mitata, mitata_text |
| JS/TS | vitest bench | vitest_bench, vitest_bench_text |
| Julia | BenchmarkTools.jl | benchmarktools_jl, benchmarktools_jl_text |
| PHP | PHPBench | phpbench_xml, phpbench_text |
| Ruby | benchmark-ips | benchmark_ips, benchmark_ips_text |
| Framework | Parser(s) |
|---|---|
| k6 | k6_summary, k6_ndjson |
| wrk | wrk |
| wrk2 | wrk2 |
| hey | hey |
| vegeta | vegeta_json |
| Locust | locust_csv, locust_text |
| JMeter | jmeter_csv, jmeter_text |
| Gatling | gatling_log |
| Framework | Parser(s) |
|---|---|
| pgbench | pgbench |
| sysbench | sysbench |
| redis-benchmark | redis_benchmark_csv, redis_benchmark_text |
| memtier_benchmark | memtier_json, memtier_text |
| ClickBench | clickbench |
| Framework | Parser(s) |
|---|---|
| Lighthouse | lighthouse |
| Framework | Parser(s) |
|---|---|
| mocha (--reporter json) | mocha_json |
| Jest (jest-junit) | junit_standard, junit_jest_text |
| go test (gotestsum) | junit_go, junit_go_text |
| JUnit 5 / Maven Surefire | junit_standard |
| dotnet test (TRX) | dotnet_test_trx |
| CTest (--output-junit) | junit_standard, ctest_text |
| Playwright | playwright_json |
| Framework | Parser(s) |
|---|---|
| hyperfine | hyperfine_json, hyperfine_csv, hyperfine_text |
| Unix time (bash + GNU) | time_builtin, time_gnu |
| perf stat | perf_stat_text |
| custom JSON | custom_bigger_is_better, custom_smaller_is_better |
| custom CSV | custom_csv |
benchzoo.parsers.llm_anthropic and benchzoo.parsers.llm_local (see
intro above) cover formats that don't match any of the built-in
parsers. Their evaluation harness lives at
tests/parsers/test_llm.py — set
BENCHZOO_RUN_LLM_ANTHROPIC=1 (+ ANTHROPIC_API_KEY) or
BENCHZOO_RUN_LLM_LOCAL=1 to run it against the real fixture corpus.
The process is documented in
docs/workflow-conventions.md and the
Definition of Done
enumerates the six surfaces every framework must ship.
The short version: implement the
canonical sample benchmark (four fixed
tests, the simplest of which sleeps for exactly 2.15 seconds), put it
under frameworks/<category>/<name>/, add a .github/workflows/<name>.yml
that captures the output as an artifact, write the parser against the
real captured fixture (grep for 2.15 in the output — that's test 1's
wall time), and add ground-truth tests.
To iterate on a parser without round-tripping through GitHub CI,
run the workflow locally via nektos/act.
See docs/act.md for the install, image selection,
and full invocation recipe.
- Design:
docs/design.md— architecture, data model, parser contract. - Parser targets:
docs/parser-targets.md— what's in scope, what's shipped, what's long-tailed. - Canonical sample benchmark:
docs/sample-benchmark.md— the four fixed tests every framework implements. - Workflow conventions:
docs/workflow-conventions.md— CI conventions for the corpus. - Running workflows locally:
docs/act.md—nektos/actinstall, image selection, running a single workflow, simulatingscheduletriggers, limitations, and secrets handling.