feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and vox bench subcommand#2
Draft
Copilot wants to merge 3 commits into
Draft
feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and vox bench subcommand#2Copilot wants to merge 3 commits into
vox bench subcommand#2Copilot wants to merge 3 commits into
Conversation
Co-authored-by: aryateja2106 <124627951+aryateja2106@users.noreply.github.com>
…x docstring categories Co-authored-by: aryateja2106 <124627951+aryateja2106@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Add Terminal-Bench 2.0 for evaluating AI agents
feat: Terminal-Bench 2.0 harness, Pi agent wrapper, and Mar 13, 2026
vox bench subcommand
aryateja2106
approved these changes
Mar 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a Terminal-Bench 2.0-inspired evaluation harness to Vox so the agent can be benchmarked against realistic terminal tasks, plus a Pi coding agent wrapper following the existing agent pattern.
Benchmark harness (
src/vox/bench/)harness.py—Task,BenchResult,Harnessdataclasses.Harness.run_task()pipes each task description through Vox's translation engine, executes the resulting shell command, then verifies correctness via averify_cmd(exit 0 = pass). Optionalsetup_cmd/teardown_cmdper task.dry_runmode translates without executing.tasks.py— 27 built-in tasks across 7 categories (file,text,process,system,archive,network,git,shell) covering common real-world terminal workflows.vox benchsubcommand (src/vox/cli.py)Pi agent (
src/vox/agents/pi.py)Minimal wrapper for
@mariozechner/pi-coding-agentfollowing the existingBaseAgentpattern. Registered inALL_AGENTSwith a routing hint for lightweight terminal tasks.Original prompt
https://huggingface.co/papers/2601.11868
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Published on Jan 16
Submitted by
taesiri
on Jan 23
Authors:
Mike A. Merrill
,
Alexander G. Shaw
,
Nicholas Carlini
,
Boxuan Li
,
Harsh Raj
,
Ivan Bercovich
,
Lin Shi
,
Jeong Yeon Shin
,
Thomas Walshe
,
E. Kelly Buchanan
,
Junhong Shen
,
Guanghao Ye
,
Haowei Lin
,
Jason Poulos
,
Maoyu Wang
,
Marianna Nezhurina
,
Jenia Jitsev
,
Di Lu
,
Orfeas Menis Mastromichalakis
,
Zhiwei Xu
,
Zizhao Chen
,
Yue Liu
+63 authors
Abstract
Terminal-Bench 2.0 presents a challenging benchmark with 89 terminal-based tasks to evaluate AI agents' capabilities in real-world scenarios.
AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .
Community
librarian-bot
Jan 23
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts (2026)
Real-Time Procedural Learning From Experience for AI Agents (2025)
Benchmarking LLM Agents for Wealth-Management Workflows (2025)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development (2026)
Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent (2025)
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios (2025)
The Hierarchy of Agentic Capabilities: Evaluating Frontier Models on Realistic RL Environments (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Reply
Tap or paste here to upload images
Sign up or log in to comment
Models citing this paper
0
No model linking this paper
Cite arxiv.org/abs/2601.11868 in a model README.md to link it from this page.
Datasets citing this paper
1
zai-org/terminal-bench-2-verified
Updated 14 days ago
•
7.25k
•
59I want you to go through this link if you can. Otherwise I can provide you with some context on the benchmarks. I want us to build robust term knowledge and robust terminology that can address and automate all the terminal tasks, from long-running tasks like shortening tasks. I think this benchmark is ideal in our particular case and we have also built a project where we trained our model. I kind of want you to test our entire harness using any co-pilot models or help me really understand and expand on these benchmarks. I want to start creating my own Pi agent
https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent
Pi is a minimal terminal coding harness. Adapt pi to your workflows, not the other way around, without having to fork and modify pi internals. Extend it with TypeScript Extensions, Skills, Prompt Templates, and Themes. Put your extensions, skills, prompt templates, and themes in Pi Packages and share them with others via npm or git.
Pi ships with powerful defaults but skips features like sub agents and plan mode. Instead, you can ask pi to build what you want or install a third party pi package that matches your workflow.
Pi runs in four modes: interactive, print or JSON, RPC for process integration, and an SDK for embedding in your own apps. See openclaw/openclaw for a real-world SDK integration.
Table of Contents
Quick Start
Providers & Models
Interactive Mode
Editor
Commands
Keyboard Shortcuts
Message Queue
Sessions
Branching
Compaction
Settings
Context Files
Customization
Prompt Templates
Skills
Extensions
Themes
Pi Packages
Programmatic Usage
Philosophy
CLI Reference
Quick Start
npm install -g @mariozechner/pi-coding-agent
Authenticate with an API key:
export ANTHROPIC_API_KEY=sk-ant-...
pi
Or use your existing subscription:
pi
/login # Then select provider
Then just talk to pi. By default, pi gives the model four tools: read, write, edit, and bash. The model uses these to fulfill your requests. Add capabilities via skills, prompt templates, extensions, or pi packages.
Platform notes: Windows | Termux (Android) | tmux | Terminal setup | Shell aliase...
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.