Terminal-Bench¶

What it is¶

Terminal-Bench is a benchmark for evaluating AI agents' ability to use a terminal. It focuses on tasks that require interacting with a real terminal environment, such as installing software, debugging system issues, and managing files.

What problem it solves¶

Measures whether AI agents can effectively operate in a terminal environment, a critical capability for autonomous system administration and DevOps tasks.

Where it fits in the stack¶

Benchmarking. Used to evaluate AI agents on terminal-based tasks that go beyond code generation.

Typical use cases¶

Evaluating AI agents on terminal interaction tasks (installation, debugging, file management)
Comparing agent frameworks on their ability to operate in real system environments
Assessing readiness of AI agents for autonomous system administration

Strengths¶

Tests practical, real-world terminal skills rather than abstract coding problems
Covers a range of system administration tasks
Complements code-generation benchmarks like SWE-bench

Limitations¶

Requires a real terminal environment for evaluation, adding setup complexity
Results may vary depending on the operating system and environment configuration
Relatively newer benchmark with a smaller community compared to established alternatives

When to use it¶

When evaluating AI agents that need to operate autonomously in terminal environments
When assessing system administration or DevOps capabilities of AI agents

When not to use it¶

When evaluating pure code generation capabilities (use HumanEval or MBPP)
When you need a well-established benchmark with extensive published results

Evaluation Methodology¶

Terminal-Bench (TB-2) uses the Harbor framework to provide a consistent, containerized execution environment. Each task is defined by: 1. Scenario: A Docker image containing the initial system state. 2. Instruction: A natural language prompt describing the goal. 3. Validator: A script that checks for successful completion (e.g., verifying a service is running or a file exists with specific content).

Example Task Configuration¶

Tasks are typically structured as YAML or JSON objects within the Harbor framework:

task_id: "nginx-load-balancer-config"
category: "networking"
difficulty: "hard"
scenario_image: "harbor/ubuntu-22.04-dev"
instruction: "Configure Nginx as a load balancer for two backend servers running on ports 8081 and 8082."
verification:
  type: "bash_script"
  script: "curl -s localhost | grep 'Backend'"

Getting started¶

Terminal-Bench (TB-2) is typically run using the tb CLI tool and requires a Docker environment for sandboxed execution.

1. Installation¶

pip install terminal-bench
# or using uv
uv tool install terminal-bench

2. Running a Task¶

To run a specific task and evaluate an agent:

# List available tasks
tb list

# Run evaluation for a specific model on a task
tb run --task_id "setup-nginx-server" --model "anthropic/claude-3-5-sonnet"

3. Harbor Framework Integration¶

For large-scale evaluations, Terminal-Bench 2.0 uses the Harbor framework:

from harbor import HarborSandbox, TerminalBenchTask

with HarborSandbox() as sandbox:
    task = TerminalBenchTask("debug-c-memory-leak")
    result = sandbox.execute_agent(task, agent_config="config.yaml")
    print(f"Task Completed: {result.success}")

Sources / references¶

Contribution Metadata¶

Last reviewed: 2026-06-01
Confidence: high