feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and `vox bench` subcommand by Copilot · Pull Request #2 · nl2shell/vox

Copilot · 2026-03-13T20:41:56Z

Adds a Terminal-Bench 2.0-inspired evaluation harness to Vox so the agent can be benchmarked against realistic terminal tasks, plus a Pi coding agent wrapper following the existing agent pattern.

Benchmark harness (`src/vox/bench/`)

harness.py — Task, BenchResult, Harness dataclasses. Harness.run_task() pipes each task description through Vox's translation engine, executes the resulting shell command, then verifies correctness via a verify_cmd (exit 0 = pass). Optional setup_cmd/teardown_cmd per task. dry_run mode translates without executing.
tasks.py — 27 built-in tasks across 7 categories (file, text, process, system, archive, network, git, shell) covering common real-world terminal workflows.

from vox.bench.harness import Harness
from vox.bench.tasks import BUILTIN_TASKS
from vox.config import VoxConfig

harness = Harness(cfg=VoxConfig())
results = harness.run_all(BUILTIN_TASKS)
harness.print_summary(results)
# Terminal-Bench Results table with per-task pass/fail, timing, and overall score %

`vox bench` subcommand (`src/vox/cli.py`)

vox bench                        # run all 27 tasks
vox bench --list                 # table of available tasks
vox bench --category file        # filter by category
vox bench --task file-touch-create  # run one task by ID
vox bench --dry-run              # translate only, skip exec/verify

Pi agent (`src/vox/agents/pi.py`)

Minimal wrapper for @mariozechner/pi-coding-agent following the existing BaseAgent pattern. Registered in ALL_AGENTS with a routing hint for lightweight terminal tasks.

# Invokes: pi --print <task>
PiAgent.run("refactor the auth module")

Original prompt

https://huggingface.co/papers/2601.11868

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Published on Jan 16

Submitted by
taesiri
on Jan 23

Authors:
Mike A. Merrill
,
Alexander G. Shaw
,
Nicholas Carlini
,
Boxuan Li
,
Harsh Raj
,
Ivan Bercovich
,
Lin Shi
,
Jeong Yeon Shin
,
Thomas Walshe
,
E. Kelly Buchanan
,
Junhong Shen
,
Guanghao Ye
,
Haowei Lin
,
Jason Poulos
,
Maoyu Wang
,
Marianna Nezhurina
,
Jenia Jitsev
,
Di Lu
,
Orfeas Menis Mastromichalakis
,
Zhiwei Xu
,
Zizhao Chen
,
Yue Liu
+63 authors
Abstract

Terminal-Bench 2.0 presents a challenging benchmark with 89 terminal-based tasks to evaluate AI agents' capabilities in real-world scenarios.
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .
Community

librarian-bot
Jan 23

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts (2026)
Real-Time Procedural Learning From Experience for AI Agents (2025)
Benchmarking LLM Agents for Wealth-Management Workflows (2025)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development (2026)
Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent (2025)
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios (2025)
The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments (2026)
Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Reply

Tap or paste here to upload images
Sign up or log in to comment
Models citing this paper
0

No model linking this paper
Cite arxiv.org/abs/2601.11868 in a model README.md to link it from this page.
Datasets citing this paper
1

zai-org/terminal-bench-2-verified
Updated 14 days ago
•
7.25k
•
59I want you to go through this link if you can. Otherwise I can provide you with some context on the benchmarks. I want us to build robust term knowledge and robust terminology that can address and automate all the terminal tasks, from long-running tasks like shortening tasks. I think this benchmark is ideal in our particular case and we have also built a project where we trained our model. I kind of want you to test our entire harness using any co-pilot models or help me really understand and expand on these benchmarks. I want to start creating my own Pi agent

https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent

Pi is a minimal terminal coding harness. Adapt pi to your workflows, not the other way around, without having to fork and modify pi internals. Extend it with TypeScript Extensions, Skills, Prompt Templates, and Themes. Put your extensions, skills, prompt templates, and themes in Pi Packages and share them with others via npm or git.

Pi ships with powerful defaults but skips features like sub agents and plan mode. Instead, you can ask pi to build what you want or install a third party pi package that matches your workflow.

Pi runs in four modes: interactive, print or JSON, RPC for process integration, and an SDK for embedding in your own apps. See openclaw/openclaw for a real-world SDK integration.

Table of Contents

Quick Start
Providers & Models
Interactive Mode
Editor
Commands
Keyboard Shortcuts
Message Queue
Sessions
Branching
Compaction
Settings
Context Files
Customization
Prompt Templates
Skills
Extensions
Themes
Pi Packages
Programmatic Usage
Philosophy
CLI Reference
Quick Start

npm install -g @mariozechner/pi-coding-agent
Authenticate with an API key:

export ANTHROPIC_API_KEY=sk-ant-...
pi
Or use your existing subscription:

pi
/login # Then select provider
Then just talk to pi. By default, pi gives the model four tools: read, write, edit, and bash. The model uses these to fulfill your requests. Add capabilities via skills, prompt templates, extensions, or pi packages.

Platform notes: Windows | Termux (Android) | tmux | Terminal setup | Shell aliase...

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: aryateja2106 <124627951+aryateja2106@users.noreply.github.com>

…x docstring categories Co-authored-by: aryateja2106 <124627951+aryateja2106@users.noreply.github.com>

Initial plan

ff07735

Copilot AI assigned Copilot and aryateja2106 Mar 13, 2026

Copilot started work on behalf of aryateja2106 March 13, 2026 20:42 View session

Copilot AI and others added 2 commits March 13, 2026 20:46

feat: add Terminal-Bench 2.0 harness, Pi agent, and vox bench subcommand

9b44bd2

Co-authored-by: aryateja2106 <124627951+aryateja2106@users.noreply.github.com>

fix: address code review — move contextlib import to module level, fi…

f8e8ef6

…x docstring categories Co-authored-by: aryateja2106 <124627951+aryateja2106@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add Terminal-Bench 2.0 for evaluating AI agents~~ feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and vox bench subcommand Mar 13, 2026

Copilot AI requested a review from aryateja2106 March 13, 2026 20:48

Copilot finished work on behalf of aryateja2106 March 13, 2026 20:48

aryateja2106 approved these changes Mar 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and `vox bench` subcommand#2

feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and `vox bench` subcommand#2
Copilot wants to merge 3 commits into
mainfrom
copilot/add-terminal-bench-2-0-benchmark

Copilot AI commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark harness (src/vox/bench/)

vox bench subcommand (src/vox/cli.py)

Pi agent (src/vox/agents/pi.py)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 13, 2026 •

edited

Loading

Benchmark harness (`src/vox/bench/`)

`vox bench` subcommand (`src/vox/cli.py`)

Pi agent (`src/vox/agents/pi.py`)