MBPP (Mostly Basic Python Problems)¶

What it is¶

MBPP is a benchmark designed to evaluate the code generation performance of LLMs on basic Python tasks. It consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry-level programmers. Each problem includes a task description, a code solution, and three automated test cases. Key metrics are Pass@1 (accuracy on the first attempt) and Pass@k (success rate with multiple samples).

What problem it solves¶

Provides a large-scale, standardized evaluation of LLM code generation on entry-level Python problems, complementing more difficult benchmarks like HumanEval.

Where it fits in the stack¶

Benchmarking. Used as a reference benchmark for evaluating basic code generation capabilities of LLMs.

Typical use cases¶

Evaluating LLM performance on straightforward Python coding tasks
Comparing code generation accuracy across models at the entry level
Complementing HumanEval results with a larger problem set

Strengths¶

Large dataset (around 1,000 problems) for statistically robust evaluation
Problems are straightforward and well-defined
Includes automated test cases for objective scoring

Limitations¶

Limited to basic Python problems; does not test advanced programming
Crowd-sourced problems may have inconsistent quality
Does not evaluate real-world software engineering tasks

When to use it¶

When evaluating basic code generation capabilities alongside HumanEval
When you need a larger problem set than HumanEval for more robust statistics

When not to use it¶

When evaluating advanced programming or real-world engineering tasks
When testing non-Python code generation

Sources / references¶

GitHub Repository

Contribution Metadata¶

Last reviewed: 2026-02-26
Confidence: medium