Skip to content

HumanEval

What it is

HumanEval is a benchmark released by OpenAI to evaluate the code generation capabilities of Large Language Models. It consists of 164 handwritten programming problems, each including a function signature, docstring, body, and several unit tests. The problems are designed to be self-contained and assess the model's ability to solve basic algorithmic tasks. The key metric is Pass@k, the probability that at least one of the top k generated code samples passes all unit tests.

What problem it solves

Provides a standardized measure of whether LLMs can generate functionally correct code from natural language descriptions.

Where it fits in the stack

Benchmarking. Used as a primary reference benchmark for code generation capabilities of LLMs.

Typical use cases

  • Evaluating LLM code generation accuracy on self-contained programming tasks
  • Comparing models on their ability to produce correct Python code
  • Measuring improvements in code generation across model versions

Strengths

  • Well-established and widely cited benchmark
  • Problems are self-contained with clear unit test validation
  • Pass@k metric accounts for sampling variability

Limitations

  • Only 164 problems, which may not cover the full range of programming tasks
  • Focuses on Python and basic algorithmic tasks
  • Does not test real-world software engineering skills like debugging or working with existing codebases

When to use it

  • When comparing LLMs on their ability to generate correct code from specifications
  • When evaluating a model for coding assistant use cases

When not to use it

  • When you need to evaluate real-world software engineering capability (use SWE-bench instead)
  • When you need multilingual code generation evaluation

Sources / references

Contribution Metadata

  • Last reviewed: 2026-02-26
  • Confidence: medium