SWE-bench¶
What it is¶
SWE-bench is a benchmark for evaluating LLMs on real-world software engineering tasks. It uses actual issues from GitHub and requires the model to generate a functional patch that passes existing tests.
What problem it solves¶
Measures whether LLMs can perform practical software engineering work -- understanding codebases, diagnosing issues, and producing working fixes -- rather than just solving isolated coding puzzles.
Where it fits in the stack¶
Benchmarking. Used as a reference benchmark for evaluating real-world software engineering capabilities of LLMs and AI agents.
Typical use cases¶
- Evaluating AI coding agents on their ability to resolve real GitHub issues
- Comparing models on practical software engineering tasks
- Tracking progress of AI agents toward autonomous software development
- Choosing whether an agent is ready for repository-maintenance work that requires reading tests, editing files, and producing a valid patch
Strengths¶
- Based on real-world GitHub issues, providing authentic evaluation
- Requires end-to-end engineering skills (reading code, understanding issues, writing patches)
- Validated by existing test suites from the source repositories
Limitations¶
- Computationally expensive to run (requires setting up real repositories and test suites)
- Limited to Python repositories in the current dataset
- Pass rates can be influenced by the specific subset of issues selected
- Public leaderboard results do not automatically prove performance on private repositories, unusual stacks, or documentation-heavy maintenance work
When to use it¶
- When evaluating AI agents or LLMs on real-world software engineering capability
- When comparing coding agents that claim to autonomously resolve issues
When not to use it¶
- When evaluating basic code generation from specifications (use HumanEval instead)
- When you need quick, lightweight benchmarking
Getting started¶
SWE-bench requires a Docker environment to safely execute untrusted code and run test suites.
1. Installation¶
pip install swebench
2. Running an Evaluation (Inference)¶
Generate model predictions for a subset of issues:
python -m swebench.inference.run_api \
--dataset_name princeton-nlp/SWE-bench_Lite \
--model_name gpt-4-0613 \
--output_dir ./predictions
3. Evaluating Predictions (Docker)¶
Use swe-bench-docker to execute the generated patches against the original repositories:
docker run -v $(pwd)/predictions:/predictions swebench/swe-bench-eval \
--predictions /predictions/gpt-4-0613.jsonl \
--output_dir /results
Technical examples¶
1. Using SWE-bench Verified¶
The "Verified" subset consists of 500 tasks that have been human-verified to be solvable and have high-quality unit tests.
from datasets import load_dataset
# Load the verified subset
dataset = load_dataset("princeton-nlp/SWE-bench_Verified", split="test")
# Access a specific task
task = dataset[0]
print(f"Task ID: {task['instance_id']}")
print(f"Problem Statement: {task['problem_statement']}")
2. Custom Evaluation Loop¶
For agentic workflows, you can integrate SWE-bench as a final validation step in your local environment.
from swebench.harness.test_spec import make_test_spec
from swebench.harness.run_evaluation import run_instance
# Define task instance
instance = {
"repo": "django/django",
"pull_number": "12345",
"instance_id": "django__django-12345",
"base_commit": "abc123...",
"patch": "diff --git a/django/db/models/fields/__init__.py...",
"test_patch": "diff --git a/tests/model_fields/tests.py..."
}
# Run evaluation in Docker
spec = make_test_spec(instance)
result = run_instance(spec)
print(f"Resolved: {result['resolved']}")
Practical evaluation notes¶
Use SWE-bench as a high-signal engineering benchmark, but interpret it as one part of an agent-readiness picture:
- Patch correctness: The benchmark rewards changes that satisfy existing test suites, which is useful for bug-fix agents but less direct for docs, taxonomy, and knowledge-base tasks.
- Repository navigation: Strong results imply the model or harness can locate relevant files, reason over issue text, and make coherent edits in a real repo.
- Harness quality: Tooling around the model matters. Search, edit, test execution, retries, and patch application can change outcomes as much as the base model.
- Local validation: For private repo adoption, run a small internal task set alongside SWE-bench-style metrics so results reflect local languages, CI shape, and review expectations.
Agent comparison checklist¶
When using SWE-bench results to compare coding agents, record:
- The exact benchmark split and date.
- The model, agent harness, tool access, and retry budget.
- Whether the run used public issue text only or any extra retrieval.
- Pass rate plus failure classes: setup failure, wrong file, incomplete patch, flaky test, or unsafe behavior.
- Cost per resolved issue, not only raw pass rate.
Related tools / concepts¶
- HumanEval
- Terminal-Bench
- DREAM: Deep Research Evaluation with Agentic Metrics
- LM Evaluation Harness
- LongCLI-Bench
- Aider
- OpenHands
- Plandex
- Claude Code
Sources / references¶
Contribution Metadata¶
- Last reviewed: 2026-05-16
- Confidence: high