SWE-bench¶

What it is¶

SWE-bench is a benchmark for evaluating LLMs on real-world software engineering tasks. It uses actual issues from GitHub and requires the model to generate a functional patch that passes existing tests.

What problem it solves¶

Measures whether LLMs can perform practical software engineering work -- understanding codebases, diagnosing issues, and producing working fixes -- rather than just solving isolated coding puzzles.

Where it fits in the stack¶

Benchmarking. Used as a reference benchmark for evaluating real-world software engineering capabilities of LLMs and AI agents.

Typical use cases¶

Evaluating AI coding agents on their ability to resolve real GitHub issues
Comparing models on practical software engineering tasks
Tracking progress of AI agents toward autonomous software development

Strengths¶

Based on real-world GitHub issues, providing authentic evaluation
Requires end-to-end engineering skills (reading code, understanding issues, writing patches)
Validated by existing test suites from the source repositories

Limitations¶

Computationally expensive to run (requires setting up real repositories and test suites)
Limited to Python repositories in the current dataset
Pass rates can be influenced by the specific subset of issues selected

When to use it¶

When evaluating AI agents or LLMs on real-world software engineering capability
When comparing coding agents that claim to autonomously resolve issues

When not to use it¶

When evaluating basic code generation from specifications (use HumanEval instead)
When you need quick, lightweight benchmarking

Sources / references¶

Contribution Metadata¶

Last reviewed: 2026-02-26
Confidence: medium