GPQA (Graduate-Level Google-Proof Q&A)¶

What it is¶

GPQA is a challenging benchmark for evaluating high-level reasoning and knowledge in LLMs. It consists of 448 multiple-choice questions written by experts (PhD-level) in biology, physics, and chemistry. The questions are designed to be "Google-proof," meaning they are difficult even for non-expert humans to solve with access to the internet. Key metrics include accuracy (percentage of correct answers) and self-consistency.

What problem it solves¶

Measures whether LLMs possess deep, expert-level scientific knowledge and reasoning that cannot be trivially looked up, providing a more rigorous assessment than general knowledge benchmarks like MMLU which are increasingly appearing in training sets.

Where it fits in the stack¶

Benchmarking. Used as a reference benchmark for evaluating advanced reasoning and scientific competence in state-of-the-art LLMs.

Typical use cases¶

Evaluating LLM performance on graduate-level scientific reasoning
Comparing models on tasks that require genuine understanding rather than surface-level retrieval
Assessing progress toward expert-level AI capabilities in STEM fields

Strengths¶

Questions are expert-written and verified to be genuinely difficult
Covers multiple scientific disciplines (Biology, Physics, Chemistry)
Resistant to simple retrieval-based strategies and internet search
High correlation with actual reasoning ability in scientific domains

Limitations¶

Limited scale (448 questions), which may not cover all scientific sub-domains
Focuses on "hard" sciences only; does not cover humanities or social sciences
Multiple-choice format may not fully capture open-ended reasoning ability
High barrier to entry for human verification (requires PhD-level experts)

When to use it¶

When comparing frontier LLMs on their ability to handle difficult, expert-level scientific questions
When you need a benchmark that is resistant to memorization and search-engine shortcuts

When not to use it¶

When evaluating code generation or practical task completion (use HumanEval or SWE-bench)
When you need broad general-knowledge evaluation for a non-expert audience (use MMLU instead)
For testing basic conversational capabilities

Getting started¶

GPQA is typically run using evaluation frameworks like the LM Evaluation Harness.

Clone the LM Evaluation Harness repository.
Install dependencies.
Run the GPQA task against your model:

python main.py \
    --model hf \
    --model_args pretrained=your-model-name \
    --tasks gpqa_main \
    --device cuda:0

Technical examples¶

Example Question Format¶

Questions are structured to test reasoning over multiple steps of scientific logic:

{
  "question": "In a certain species of flowering plant, the locus for flower color has two alleles...",
  "explanation": "To solve this, we must first calculate the allele frequencies using the Hardy-Weinberg equilibrium...",
  "choice_a": "1/4",
  "choice_b": "1/2",
  "choice_c": "3/4",
  "choice_d": "1",
  "answer": "choice_c"
}

Performance Comparison (Representative)¶

Frontier models typically show a significant gap between "Diamond" (expert-verified) and general accuracy:

Model	GPQA Diamond (Acc)
Claude 3.5 Sonnet	~59.4%
GPT-4o	~53.6%
Claude 3 Opus	~50.4%
Human (Non-expert)	~34%

Sources / references¶

Contribution Metadata¶

Last reviewed: 2026-05-14
Confidence: high