GSM8K (Grade School Math 8K)¶
What it is¶
GSM8K is a benchmark for evaluating the multi-step mathematical reasoning capabilities of LLMs. It contains 8.5K high-quality grade school math word problems that require 2 to 8 steps of basic arithmetic to solve. The key metric is Exact Match (EM), measuring the accuracy of the final numerical answer.
What problem it solves¶
Provides a standardized way to measure whether LLMs can perform multi-step arithmetic reasoning, a fundamental capability for practical mathematical tasks and logical planning. It moves beyond simple "calculator" tasks to test the model's ability to decompose a problem into logical steps.
Where it fits in the stack¶
Benchmarking. Serves as a widely used reference for evaluating mathematical reasoning and Chain-of-Thought (CoT) efficacy in LLMs.
Typical use cases¶
- Evaluating LLM arithmetic and multi-step reasoning capabilities
- Measuring the impact of prompting strategies (e.g., Chain-of-Thought, Few-Shot) on math performance
- Comparing models on fundamental mathematical problem-solving
- Regression testing for fine-tuned models to ensure math capabilities haven't degraded
Strengths¶
- Large dataset (8.5K problems) provides statistically meaningful results
- Problems are well-defined with unambiguous numerical answers
- Widely adopted, enabling easy cross-model comparisons
- High correlation with a model's general reasoning and instruction-following ability
Limitations¶
- Limited to grade-school-level math; does not test advanced mathematics (calculus, linear algebra)
- Problems are relatively formulaic compared to real-world math applications
- Exact Match scoring does not give partial credit for correct reasoning with minor arithmetic errors
- Increasing evidence of benchmark contamination in newer models
When to use it¶
- When comparing LLMs on basic mathematical reasoning and logical consistency
- When evaluating the effect of different prompting techniques (like
Let's think step by step) on math performance - For base-level capability checks during model development
When not to use it¶
- When you need to evaluate advanced mathematical reasoning (use MATH Benchmark instead)
- When testing non-mathematical or creative writing capabilities
- For evaluating complex symbolic logic or theorem proving
Getting started¶
GSM8K can be evaluated using standard libraries like lm-eval.
- Install the LM Evaluation Harness:
pip install lm-eval - Run the evaluation:
lm_eval --model hf \
--model_args pretrained=eleutherai/pythia-70m \
--tasks gsm8k \
--device cuda:0 \
--batch_size 8
Technical examples¶
Example Problem¶
A typical GSM8K problem involves multiple steps:
Question: Janet has 30 apples. She gives 10 to her neighbor and then buys 15 more. How many apples does she have now? Reasoning: 1. Janet starts with 30. 2. She gives away 10: 30 - 10 = 20. 3. She buys 15 more: 20 + 15 = 35. Answer: 35
Evaluation with Chain-of-Thought¶
Prompting models with Let's think step by step often significantly improves GSM8K scores:
Q: [GSM8K Question]
A: Let's think step by step.
[Model generates reasoning steps]
Therefore, the answer is [Number].
Related tools / concepts¶
- MATH Benchmark
- ASDiv
- DREAM: Deep Research Evaluation with Agentic Metrics
- SWE-bench
- LM Evaluation Harness
- GPQA
- MMLU
Sources / references¶
Contribution Metadata¶
- Last reviewed: 2026-05-14
- Confidence: high