Humanity's Last Exam (HLE)¶
What it is¶
HLE is a benchmark designed to test the limits of LLMs on the most difficult human-level tasks. It consists of 3,000 highly complex, multi-disciplinary questions across over a hundred subjects (Mathematics, Physics, Biology, Humanities, etc.). Created by the Center for AI Safety (CAIS) and Scale AI, it represents a "frontier" benchmark where current state-of-the-art models still perform poorly, often achieving near-zero accuracy on its hardest subsets.
What problem it solves¶
Addresses the "saturation" of existing benchmarks like MMLU and GPQA. As frontier models reach or exceed human-level performance on older tests, those tests lose their utility as measurement tools. HLE provides a new ceiling, ensuring that progress toward expert-level reasoning remains measurable.
Where it fits in the stack¶
Benchmarking. Serves as a high-difficulty knowledge and reasoning benchmark for evaluating the upper limits of LLM and multi-modal model capabilities.
Typical use cases¶
- Frontier Model Evaluation: Comparing the reasoning capabilities of state-of-the-art models (GPT-4o, Claude 3.5 Sonnet, Llama 3.1 405B).
- Multi-modal Assessment: Testing models on questions that require both textual reasoning and image understanding (14% of the dataset is multi-modal).
- Calibration Testing: Measuring whether models accurately estimate their own confidence in their answers.
Strengths¶
- Extreme Difficulty: Designed to be the "last academic exam," remaining challenging even as models improve.
- Closed-ended & Verifiable: Answers are precise, allowing for automated, low-cost evaluation.
- Subject Diversity: Covers over 100 subjects with questions sourced from world-class experts.
- Private Set: Includes a held-out private set to combat data contamination and benchmark hacking.
Limitations¶
- Not for Everyday Tasks: Does not measure "helpful assistant" capabilities or basic instruction following.
- Low Signal for Small Models: Smaller or older models often score near zero, making it difficult to distinguish between them.
- Requires LLM Judge: While answers are closed-ended, the variety of possible formats (decimals vs. fractions) often requires an LLM judge for automated scoring.
When to use it¶
- When evaluating frontier models on the hardest available reasoning tasks.
- When existing benchmarks like MMLU or GPQA show signs of saturation (models scoring >90%).
- When testing a model's ability to handle world-class scientific or mathematical problems.
When not to use it¶
- When evaluating models for general-purpose chat or basic RAG tasks.
- When you need a lightweight, fast-running benchmark for early-stage development.
- When you are optimizing for speed or low-cost inference rather than peak intelligence.
Getting started (CLI Example)¶
HLE can be run using the UK Government's Inspect framework.
# Install inspect and the evals package
pip install inspect-ai inspect_evals
# Run the HLE benchmark against an OpenAI model
inspect eval inspect_evals/hle --model openai/gpt-4o
Evaluation Format¶
Models are typically prompted to provide their response in a structured format to facilitate judging:
Explanation: {detailed reasoning}
Answer: {final exact answer}
Confidence: {0-100%}
Related tools / concepts¶
- GPQA - Graduate-level Google-proof Q&A.
- MMLU - Massive Multitask Language Understanding.
- ARC (AI2 Reasoning Challenge) - Challenging questions for reasoning.
- GSM8K - Grade school math word problems.
- Chatbot Arena - Crowdsourced ELO ratings for LLMs.
- DREAM: Deep Research Evaluation with Agentic Metrics - Agentic evaluation framework.
- SWE-bench - Software engineering benchmark.
- LM Evaluation Harness - Unified framework for running multiple benchmarks.
Sources / references¶
Contribution Metadata¶
- Last reviewed: 2026-06-01
- Confidence: high