Framework for evaluating and improving agents
-
Updated
Jun 27, 2026 - Python
Framework for evaluating and improving agents
A harness optimized to smaller LLMs
Official Implementation of "CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion"
Strands-based agents and harnesses for agentic benchmarks.
Spoox CLI - Terminal Agent - SPlit lOOp eXand agent
Fast, Multi-Cloud Sandbox Engine for AI Agents
Trajectories for running OpenHands on Terminal Bench
Codex-style REPL for terminal-agent models trained with camel-ai TerminalToolkit / terminal-bench terminus-2 protocols. Built to drive HansBug/OpenClaw-RL checkpoints.
Multi-agent reasoning MCP server for Claude Code. Spawns parallel research agents to find knowledge LLMs don't have. +23.1% on Terminal Bench 2.0 SWE tasks.
Solving the amnesiac problem for LLM agents. Research series on agents that compound knowledge across sessions — first measurement: +4.6 pp accuracy lift on Terminal-Bench 2.1 with an open-weight executor and a single failure-derived skill file.
Multi-layer audit framework for agent benchmark integrity
reproducible Terminal-Bench task that evaluates a Bash script for parsing log files.
Deterministic Linux log analysis benchmark task for AI-agent evaluation using Docker and Terminal-Bench
Full self-improving long-horizon LLM agent with strategy memory, failure analysis, Grok teacher labels, and QLoRA student distillation.
Company-neutral workflow kit for creating, reviewing, calibrating, and packaging Harbor / Terminal-Bench tasks with LLM agents.
Autonomous coding agent for Terminal-Bench 2.0 — 61.6% ± 1.9 on the official leaderboard with GPT-5.1 Codex Mini.
Central repository for Project Terminus task submissions. These environments and Oracle solutions are designed to evaluate state-of-the-art AI agents (like GPT-5.2 and Claude Opus 4.6) on complex, multi-step engineering challenges within a sandboxed terminal.
Terminal-Bench 3 task where agents must reverse-engineer a sealed legacy CUA replay engine using only a query-limited black-box simulator. Reward 0.0 across Codex gpt-5.5 and Claude Opus 4.8 /run and /cheat trials.
Transparent benchmark results for Vorflux — SWE-bench Verified (91%) & Terminal Bench 2 (86%)
⚓ Harbor benchmark adapter for running AgentPlane as a reproducible coding-agent harness with evidence artifacts.
Add a description, image, and links to the terminal-bench topic page so that developers can more easily learn about it.
To associate your repository with the terminal-bench topic, visit your repo's landing page and select "manage topics."