terminal-bench

Codex-style REPL for terminal-agent models trained with camel-ai TerminalToolkit / terminal-bench terminus-2 protocols. Built to drive HansBug/OpenClaw-RL checkpoints.

repl terminus openai-compatible camel-ai qwen3 terminal-agent terminal-bench openclaw-rl

Updated May 12, 2026
Python

sam-siavoshian / Symposium

Star

Multi-agent reasoning MCP server for Claude Code. Spawns parallel research agents to find knowledge LLMs don't have. +23.1% on Terminal Bench 2.0 SWE tasks.

research mcp multi-agent developer-tools ai-agents claude training-data fine-tuning dpo nia llm model-context-protocol mcp-server claude-code terminal-bench research-engine

Updated May 23, 2026
TypeScript

Solving the amnesiac problem for LLM agents. Research series on agents that compound knowledge across sessions — first measurement: +4.6 pp accuracy lift on Terminal-Bench 2.1 with an open-weight executor and a single failure-derived skill file.

reproducible-research ai-agents cost-optimization skill-learning openrouter agent-benchmark terminal-bench open-weight-llm

Updated Jun 11, 2026

tianyi-zhang-02 / monitorstress

Star

Multi-layer audit framework for agent benchmark integrity

llm-agents llm-as-judge agent-evaluation swe-bench reward-hacking terminal-bench benchmark-auditing

Updated Jun 8, 2026
Python

ayush0824 / parse-log-stats

Star

reproducible Terminal-Bench task that evaluates a Bash script for parsing log files.

python llm agentic-ai terminal-bench

Updated Apr 6, 2026
Dockerfile

ruslandavidenko / terminal-bench-log-summary-audit

Star

Deterministic Linux log analysis benchmark task for AI-agent evaluation using Docker and Terminal-Bench

python linux docker benchmarking automation pytest ai-agents terminal-bench

Updated May 26, 2026
Python

Ajeenckya5 / LLM_Agent_Distillation

Star

Full self-improving long-horizon LLM agent with strategy memory, failure analysis, Grok teacher labels, and QLoRA student distillation.

llama agents knowledge-distillation llm chromadb qlora terminal-bench long-horizon-tasks self-improving-agents agentbench

Updated May 26, 2026
Python

lemegetonV / terminal-bench-task-workflow

Star

Company-neutral workflow kit for creating, reviewing, calibrating, and packaging Harbor / Terminal-Bench tasks with LLM agents.

harbor llm-agents terminal-bench task-authoring benchmark-tasks

Updated May 18, 2026
Python

sady4850 / hookele-agent

Star

Autonomous coding agent for Terminal-Bench 2.0 — 61.6% ± 1.9 on the official leaderboard with GPT-5.1 Codex Mini.

benchmark openai autonomous-agent ai-agent llm-agent coding-agent terminal-bench

Updated May 20, 2026
Python

mtepenner / snorkel-tasks

Star

Central repository for Project Terminus task submissions. These environments and Oracle solutions are designed to evaluate state-of-the-art AI agents (like GPT-5.2 and Claude Opus 4.6) on complex, multi-step engineering challenges within a sandboxed terminal.

ai-agents llm-evaluation cli-automation terminal-bench project-terminus snorkel-ai

Updated May 22, 2026
C++

0xtigerclaw / cua-legacy-replay

Star

Terminal-Bench 3 task where agents must reverse-engineer a sealed legacy CUA replay engine using only a query-limited black-box simulator. Reward 0.0 across Codex gpt-5.5 and Claude Opus 4.8 /run and /cheat trials.

black-box reverse-engineering cua agent-evaluation terminal-bench

Updated Jun 26, 2026
Python

piyushhhxyz / vorflux-swe-benchmarks

Star

Transparent benchmark results for Vorflux — SWE-bench Verified (91%) & Terminal Bench 2 (86%)

evaluation benchmarks ai-agent swe-bench terminal-bench vorflux

Updated May 20, 2026
JavaScript

basilisk-labs / agentplane-harbor-adapter

Star

⚓ Harbor benchmark adapter for running AgentPlane as a reproducible coding-agent harness with evidence artifacts.

python benchmark evaluation developer-tools harbor ai-agents coding-agents terminal-bench agentplane

Updated May 3, 2026
Python

Improve this page

Add a description, image, and links to the terminal-bench topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the terminal-bench topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terminal-bench

Here are 20 public repositories matching this topic...

harbor-framework / harbor

itayinbarr / little-coder

LiberCoders / CLI-Gym

strands-labs / benchmark-harnesses

plaume8 / spoox

scitix / Agent-Sandbox

li-boxuan / Terminal-bench-OpenHands-trajectories

HansBug / oc-repl

sam-siavoshian / Symposium

mrazakhan / ContinuumAI

tianyi-zhang-02 / monitorstress

ayush0824 / parse-log-stats

ruslandavidenko / terminal-bench-log-summary-audit

Ajeenckya5 / LLM_Agent_Distillation

lemegetonV / terminal-bench-task-workflow

sady4850 / hookele-agent

mtepenner / snorkel-tasks

0xtigerclaw / cua-legacy-replay

piyushhhxyz / vorflux-swe-benchmarks

basilisk-labs / agentplane-harbor-adapter

Improve this page

Add this topic to your repo