Warning
π§ Do Not Use β History Will Be Rewritten π§
This repo is undergoing major restructuring as we selectively open-source internal tools built at Idea Crafters LLC. Git history will be force-pushed and rewritten multiple times. Do not fork, clone, or depend on this repo in any capacity until we tag a stable release.
General-purpose benchmarking tool β answers "which approach is better, and by how much?" for any measurable task: tools, implementations, deps, LLM calls, agents.
go install hop.top/ben/cmd/ben@latest
ben depends on hop.top/kit kit/v0.4.0-alpha.3, pinned in go.mod
with no local override. Local development against unreleased kit
revisions uses a replace directive in go.mod (commented-out
example near the bottom of the file):
// replace hop.top/kit => ../kitUncomment, point at your kit checkout, and go mod tidy.
# Inline run: compare two CLI tools on a task
ben run --task "Find HTTP handlers" --candidates xray,grep --metric latency_ms,quality_score \
--scorer weighted:latency_ms=0.3,quality_score=0.7 --input repo=.
# Suite file: run a named, repeatable benchmark
ben run --suite .ben/suites/codebase-indexing.yaml
# Compare two historical runs
ben compare 01HX...abc 01HX...def
# List last 10 runs for a suite
ben list --suite codebase-indexing --last 10
# Show one run by id
ben show 01HX...abc| Command | Description |
|---|---|
ben run |
Run benchmark suite or inline task against candidates |
ben list |
List recent runs from local storage |
ben show <run-id> |
Show details of one run |
ben compare <run-a> <run-b> |
Diff two run results side-by-side |
ben suite list |
List known suites (global + project-local) |
ben suite show <name> |
Show suite spec details |
ben registry push <run-id> |
Push a run to the shared registry |
ben registry pull |
Pull community baselines for a suite |
ben config path / paths |
Inspect ben config file precedence |
ben spec |
Emit machine-readable capability manifest |
| Adapter | How ben runs the candidate |
|---|---|
cli |
Spawns a shell command; captures stdout/stderr, exit code, latency |
llm |
Calls an LLM via API; captures tokens, cost, output |
eva |
Wraps eva run as a ben candidate for standard eval suites |
| binary | Any ben-adapter-* binary on PATH; communicates via stdio JSON protocol |
| Metric | Source | Description |
|---|---|---|
latency_ms |
built-in | Wall-clock execution time in milliseconds |
exit_code |
built-in | Process exit code (cli adapter) |
output_size |
built-in | Byte length of stdout output |
tokens |
llm | Total tokens consumed (prompt + completion) |
cost_usd |
llm | Estimated cost in USD |
quality_score |
plugin | 0β1 relevance score; requires llm_judge plugin |
| Scorer | Description |
|---|---|
single:<metric> |
Rank by one metric; lowest wins for cost/latency |
weighted:<m>=<w>,... |
Weighted sum across metrics; highest score wins |
raw |
No ranking; emit raw metrics only; winner=null |
Examples:
--scorer single:latency_ms
--scorer weighted:latency_ms=0.3,cost_usd=0.2,quality_score=0.5
--scorer raw
name: codebase-indexing
description: Compare xray vs grep for initial codebase orientation
version: 1
task:
prompt: "Find all HTTP handler functions in this repo"
input:
repo: ./testdata/sample-repo
candidates:
- name: xray
adapter: cli
cmd: "xray explore --search {{input.prompt}} --path {{input.repo}}"
- name: grep
adapter: cli
cmd: "grep -r 'func.*Handler' {{input.repo}}"
metrics:
- latency_ms
- quality_score
scorer:
strategy: weighted
weights:
latency_ms: 0.3
quality_score: 0.7Binary plugins are auto-discovered as ben-adapter-<name> or ben-reporter-<name> on PATH.
Ben communicates via newline-delimited JSON over stdio: it writes a request JSON object to the
plugin's stdin and reads the response from stdout. Adapter plugins receive
{"action":"run","candidate":{...},"input":{...}} and must respond with
{"metrics":{...},"output":"..."}. Reporter plugins receive {"run":{...}} and write
formatted output to stdout. Naming convention: use the adapter/reporter name as the suffix,
e.g. ben-adapter-docker, ben-reporter-markdown.
ben is designed for programmatic use mid-task:
# Machine-readable output; all logs to stderr
ben run --suite my-suite --format json --quiet
# Parse winner directly
ben run ... --format json | jq .winner--format jsonβ emits valid JSON to stdout; diagnostics to stderr only--quietβ suppresses stderr; clean for pipelines- Exit
0β successful run (candidate failures are in the result, not exit code) - Exit
1β ben error (bad config, missing adapter, etc.) winnerfield β primary decision signal for agents;nullwhen scorer israw
Global (cross-project):
~/.local/share/ben/
runs/ # persisted run results
registry/ # local registry index + cache
suites/ # global suite specs
Project-local (detected automatically when .ben/ exists in cwd):
.ben/
suites/ # project-scoped suite specs
runs/ # project-scoped run results
Ben prefers project-local storage when .ben/ is present; falls back to global.
Ben loads config from three layers, highest precedence first:
| Layer | Path |
|---|---|
| project | ./.ben/config.yaml |
| user | $XDG_CONFIG_HOME/ben/config.yaml |
| system | /etc/ben/config.yaml |
Run ben config paths --format json to see the active chain. The
-c <path> flag overrides the discovery chain entirely (kit semantics
β -c wins over any previously discovered file).
The project-layer path is caller-context-aware via the KIT_INVOKED_AS
env var (exported by callers like tlc or hop before exec'ing ben):
KIT_INVOKED_AS |
Project config path |
|---|---|
| (unset/standalone) | ./.ben/config.yaml |
hop |
./.hop/ben.yaml |
tlc |
./.tlc/ben.yaml |
Only one project-layer entry wins per invocation (kit constraint).
Release pipeline mirrors the hop-top/.github reusable workflows:
release-please.ymlwatchesmain, opens a standing release PR that bumps the version + assembles the changelog.- Merging that PR cuts a
ben/v<version>tag. - The tag push fires
publish.yml(Go module mirror tohop-top/ben) andgoreleaser-on-tag.yml(cross-platform binaries- Homebrew tap + Scoop bucket entries) in parallel.
Prerelease channel is seeded at 0.2.0-alpha.0. See
.github/RELEASE-BOOTSTRAP.md for
the manual web-side steps (mirror-repo creation, GitHub App
installation, org secrets) required before the first cut.
compile: version "go1.26.1" does not match go tool version "go1.26.2"
Cause: stale GOROOT exported from an earlier mise shell. Quick
workaround: env -u GOROOT go test ./.... Long-term fix: mise use go@<latest> and respawn the shell.
See docs/contributing.md for interfaces, how to add adapters/metrics/ scorers/reporters, and the PR checklist.