A laptop-runnable, deterministically-graded, Harbor-compatible benchmark of terminal bioinformatics tasks for tracking open-model progress.
Status — V1.1 complete. The repository ships the three Docker images (
bioterm-base:v1,bioterm-variant:v1,bioterm-rnaseq:v1), the validation / build / oracle-run tooling, CI, authoring docs, and thescripts/new_task.shscaffolder. After the first full Harbor k=3 sweep (5 models × 30 tasks) we removed 13 tasks that every tested model always solved, then ran two rounds of k=1 × 4-model difficulty probes on 15 new candidate tasks and kept 8 of them. The pack now has 25 tasks spread across 6 categories. Harbor runs and rationale live in docs/status.md and docs/harbor-runs.md.
A curated set of bioinformatics tasks (alignment, variant calling, RNA-seq DE, sequence search, QC, metadata glue, and end-to-end synthesis) that a coding agent must complete in a terminal environment. Grading is deterministic — bcftools isec, set comparison, numeric tolerance, or exact file diff. No LLM judges.
The 25 tasks currently shipping (see docs/status.md for the full roadmap):
| task | category | grader | image |
|---|---|---|---|
flagstat-report |
alignment | Pattern A | bioterm-base |
bam-depth-histogram |
alignment | Pattern C | bioterm-base |
sam-to-cram-roundtrip |
alignment | Pattern A | bioterm-base |
bam-mark-duplicates |
alignment | Pattern C | bioterm-base |
insert-size-stats |
alignment | Pattern C | bioterm-base |
bam-clipping-stats |
alignment | Pattern C | bioterm-base |
gtf-to-bed |
metadata | Pattern A | bioterm-base |
vcf-to-tsv |
metadata | Pattern A | bioterm-base |
fasta-rename-headers |
metadata | Pattern A | bioterm-base |
fasta-longest-orf |
metadata | Pattern A | bioterm-base |
gtf-longest-transcript-cds |
metadata | Pattern A | bioterm-base |
fasta-6frame-codon-counts |
metadata | Pattern A | bioterm-base |
fastqc-quality-report |
qc | Pattern C | bioterm-base |
seqkit-fasta-stats |
qc | Pattern A | bioterm-base |
fastq-per-base-quality |
qc | Pattern C | bioterm-base |
vcf-split-by-type |
variant | Pattern D | bioterm-base |
vcf-allele-frequency |
variant | Pattern C | bioterm-base |
vcf-normalize |
variant | Pattern D | bioterm-base |
vcf-trio-mendelian |
variant | Pattern C | bioterm-base |
vcf-region-variant-density |
variant | Pattern C | bioterm-base |
blast-best-hit |
search | Pattern A | bioterm-base |
mafft-msa-identity |
search | Pattern C | bioterm-base |
kmer-jaccard-matrix |
search | Pattern C | bioterm-base |
tpm-normalize |
rnaseq | Pattern C | bioterm-rnaseq |
rnaseq-tpm-top-biotype |
rnaseq | Pattern C | bioterm-rnaseq |
V1 holds three commitments that override everything else:
- Deterministic grading only. If a task's grader cannot be written in ~30 lines of bash + Python, it is not in V1.
- Laptop-runnable. Full evaluation of one model against all tasks completes in under 5 hours on a 36 GB Apple Silicon MacBook. No mammalian WGS, no GPUs, no 100 GB references.
- Harbor-compatible. BioTerm-Bench is a dataset, not a harness. Orchestration, logging, agent integration, and parallelism come from Harbor.
bioterm-bench/
├── docker/
│ ├── bioterm-base/ # HTS stack, QC/search, R/Python, references
│ ├── bioterm-rnaseq/ # STAR / salmon / featureCounts / HTSeq (stub)
│ └── bioterm-variant/ # freebayes / VEP / SnpEff / hap.py (stub)
├── tasks/ # one directory per task (see docs/task-authoring.md)
├── scripts/
│ ├── build_images.sh # buildx, multi-arch
│ ├── new_task.sh # scaffold a new task directory
│ ├── validate_task.py # task-structure linter
│ ├── run_oracle_all.sh # oracles must all hit reward=1
│ └── refresh_db_golds.sh # regenerate DB golds from live APIs
├── docs/
│ ├── task-authoring.md
│ ├── grading-patterns.md
│ └── live-network-pattern.md
├── dataset.json # Harbor dataset registration
└── .github/workflows/ci.yml
scripts/build_images.sh --load --only base
docker run --rm bioterm-base:v1 samtools --version | head -1Multi-arch push to a registry:
GHCR_ORG=your-org scripts/build_images.sh --pushpython scripts/validate_task.pyExits 0 with "no tasks found" on a clean checkout.
bash scripts/run_oracle_all.shThis is the gate before release: any task that doesn't hit reward=1
with its own oracle solution is not shippable.
NCBI_API_KEY=... scripts/refresh_db_golds.shSee docs/live-network-pattern.md.
uv tool install harbor
# (or `pip install harbor`, whichever your environment prefers)
bash scripts/build_images.sh --only base # and --only variant / rnaseq
export OPENAI_API_KEY=sk-...
# single-task smoke test
harbor run --path tasks/gtf-to-bed \
--agent terminus-2 -m openai/gpt-5.4-mini --no-delete
# full current pack run
harbor run --path tasks/ \
--agent terminus-2 -m openai/gpt-5.4-mini \
--no-delete -n 4 -k 1--no-delete is required when running with the prebuilt images in
this repo: Harbor's default --delete flag runs
docker compose down --rmi all after every trial, which would wipe the
bioterm-*:v1 images from the local Docker engine.
For variance estimation, run each model three times (-k 3) and
average the per-task binary scores.
Baseline receipts — cost, wall clock, and per-task pass/fail for the reference Harbor runs — live in docs/harbor-runs.md.
Read docs/task-authoring.md, then pick one of the five allowed grader shapes from docs/grading-patterns.md. If your task does not fit one of the five, redesign or cut it.
To jump-start the directory structure:
scripts/new_task.sh my-new-task \
--image bioterm-base:v1 \
--category alignment \
--difficulty easyThe scaffolded task is structurally valid but intentionally fails
validate_task.py (TODO markers in solve.sh) until you fill in real
logic.
Numbers land once V1 tasks are authored and models are run. Reporting format:
model,overall,alignment,variant,rnaseq,db,search,qc,metadata,synthesis,mean_wallclock_s,mean_tokens
anthropic/claude-opus-4-7,,,,,,,,,,
openai/gpt-5,,,,,,,,,,
deepseek-v3.2-think,,,,,,,,,,
qwen3-coder-next,,,,,,,,,,
Vojtech Bystry and Petr Simecek. Apache 2.0 licensed — see LICENSE.