Skip to content

feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and vox bench subcommand#2

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/add-terminal-bench-2-0-benchmark
Draft

feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and vox bench subcommand#2
Copilot wants to merge 3 commits into
mainfrom
copilot/add-terminal-bench-2-0-benchmark

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 13, 2026

Adds a Terminal-Bench 2.0-inspired evaluation harness to Vox so the agent can be benchmarked against realistic terminal tasks, plus a Pi coding agent wrapper following the existing agent pattern.

Benchmark harness (src/vox/bench/)

  • harness.pyTask, BenchResult, Harness dataclasses. Harness.run_task() pipes each task description through Vox's translation engine, executes the resulting shell command, then verifies correctness via a verify_cmd (exit 0 = pass). Optional setup_cmd/teardown_cmd per task. dry_run mode translates without executing.
  • tasks.py — 27 built-in tasks across 7 categories (file, text, process, system, archive, network, git, shell) covering common real-world terminal workflows.
from vox.bench.harness import Harness
from vox.bench.tasks import BUILTIN_TASKS
from vox.config import VoxConfig

harness = Harness(cfg=VoxConfig())
results = harness.run_all(BUILTIN_TASKS)
harness.print_summary(results)
# Terminal-Bench Results table with per-task pass/fail, timing, and overall score %

vox bench subcommand (src/vox/cli.py)

vox bench                        # run all 27 tasks
vox bench --list                 # table of available tasks
vox bench --category file        # filter by category
vox bench --task file-touch-create  # run one task by ID
vox bench --dry-run              # translate only, skip exec/verify

Pi agent (src/vox/agents/pi.py)

Minimal wrapper for @mariozechner/pi-coding-agent following the existing BaseAgent pattern. Registered in ALL_AGENTS with a routing hint for lightweight terminal tasks.

# Invokes: pi --print <task>
PiAgent.run("refactor the auth module")
Original prompt

https://huggingface.co/papers/2601.11868

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Published on Jan 16

Submitted by
taesiri
on Jan 23

Authors:
Mike A. Merrill
,
Alexander G. Shaw
,
Nicholas Carlini
,
Boxuan Li
,
Harsh Raj
,
Ivan Bercovich
,
Lin Shi
,
Jeong Yeon Shin
,
Thomas Walshe
,
E. Kelly Buchanan
,
Junhong Shen
,
Guanghao Ye
,
Haowei Lin
,
Jason Poulos
,
Maoyu Wang
,
Marianna Nezhurina
,
Jenia Jitsev
,
Di Lu
,
Orfeas Menis Mastromichalakis
,
Zhiwei Xu
,
Zizhao Chen
,
Yue Liu
+63 authors
Abstract

Terminal-Bench 2.0 presents a challenging benchmark with 89 terminal-based tasks to evaluate AI agents' capabilities in real-world scenarios.
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .
Community

librarian-bot
Jan 23

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts (2026)
Real-Time Procedural Learning From Experience for AI Agents (2025)
Benchmarking LLM Agents for Wealth-Management Workflows (2025)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development (2026)
Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent (2025)
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios (2025)
The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments (2026)
Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Reply

Tap or paste here to upload images
Sign up or log in to comment
Models citing this paper
0

No model linking this paper
Cite arxiv.org/abs/2601.11868 in a model README.md to link it from this page.
Datasets citing this paper
1

zai-org/terminal-bench-2-verified
Updated 14 days ago

7.25k

59I want you to go through this link if you can. Otherwise I can provide you with some context on the benchmarks. I want us to build robust term knowledge and robust terminology that can address and automate all the terminal tasks, from long-running tasks like shortening tasks. I think this benchmark is ideal in our particular case and we have also built a project where we trained our model. I kind of want you to test our entire harness using any co-pilot models or help me really understand and expand on these benchmarks. I want to start creating my own Pi agent

https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent

Pi is a minimal terminal coding harness. Adapt pi to your workflows, not the other way around, without having to fork and modify pi internals. Extend it with TypeScript Extensions, Skills, Prompt Templates, and Themes. Put your extensions, skills, prompt templates, and themes in Pi Packages and share them with others via npm or git.

Pi ships with powerful defaults but skips features like sub agents and plan mode. Instead, you can ask pi to build what you want or install a third party pi package that matches your workflow.

Pi runs in four modes: interactive, print or JSON, RPC for process integration, and an SDK for embedding in your own apps. See openclaw/openclaw for a real-world SDK integration.

Table of Contents

Quick Start
Providers & Models
Interactive Mode
Editor
Commands
Keyboard Shortcuts
Message Queue
Sessions
Branching
Compaction
Settings
Context Files
Customization
Prompt Templates
Skills
Extensions
Themes
Pi Packages
Programmatic Usage
Philosophy
CLI Reference
Quick Start

npm install -g @mariozechner/pi-coding-agent
Authenticate with an API key:

export ANTHROPIC_API_KEY=sk-ant-...
pi
Or use your existing subscription:

pi
/login # Then select provider
Then just talk to pi. By default, pi gives the model four tools: read, write, edit, and bash. The model uses these to fulfill your requests. Add capabilities via skills, prompt templates, extensions, or pi packages.

Platform notes: Windows | Termux (Android) | tmux | Terminal setup | Shell aliase...


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits March 13, 2026 20:46
Co-authored-by: aryateja2106 <124627951+aryateja2106@users.noreply.github.com>
…x docstring categories

Co-authored-by: aryateja2106 <124627951+aryateja2106@users.noreply.github.com>
Copilot AI changed the title [WIP] Add Terminal-Bench 2.0 for evaluating AI agents feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and vox bench subcommand Mar 13, 2026
Copilot AI requested a review from aryateja2106 March 13, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants