Skip to content

ARC (AI2 Reasoning Challenge)

What it is

The AI2 Reasoning Challenge (ARC) is a question-answering dataset consisting of 7,787 multiple-choice science questions, primarily sourced from grade-school standardized assessments. It is divided into an ARC-Easy set and a more rigorous ARC-Challenge set.

What problem it solves

Traditional QA benchmarks often include questions that can be solved via simple information retrieval or statistical pattern matching. ARC's "Challenge Set" specifically filters out these types of questions, requiring models to perform multi-hop reasoning and utilize commonsense background knowledge. It serves as a litmus test for "System 2" thinking in LLMs.

Where it fits in the stack

ARC is part of the Benchmarking layer, used to evaluate the reasoning and natural language understanding (NLU) capabilities of large language models. It is a staple in the Open LLM Leaderboard.

Typical use cases

  • Evaluating the zero-shot or few-shot reasoning performance of LLMs.
  • Comparing the "deep inference" capabilities of different model architectures.
  • Researching hybrid reasoning systems that combine neural and symbolic approaches.
  • Validating the impact of specialized reasoning fine-tuning (e.g., Chain-of-Thought).

Strengths

  • Reasoning-Focus: The Challenge Set is explicitly designed to resist simple retrieval-based solutions.
  • Naturally Authored: Questions are taken from real exams, not generated by other AI.
  • Diverse Reasoning Types: Includes cause-and-effect, analogy, and categorical reasoning.
  • Standardized: widely adopted by the research community, allowing for broad comparisons.

Limitations

  • Domain Specific: Limited primarily to elementary and middle-school science.
  • Multiple Choice: Does not evaluate generative capabilities or open-ended explanation.
  • No Diagrams: The dataset excludes questions that require visual reasoning (multimodality).
  • Potential Data Contamination: Being a classic benchmark, it may be over-represented in training sets.

When to use it

Use ARC when you want a rigorous evaluation of an LLM's general reasoning abilities beyond simple factoid retrieval. It is especially useful for testing "small" models (SLMs) to see if they possess emergent reasoning capabilities.

When not to use it

Do not use it as a sole metric for specialized domains (like law or medicine) or for testing code generation/mathematical proof capabilities. For vision-based reasoning, use benchmarks like MMMU.

Technical examples

Running Evaluation (LM Eval Harness)

The most common way to run ARC is via the lm-evaluation-harness.

# Run ARC-Challenge in 0-shot mode
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3-8B \
    --tasks arc_challenge \
    --device cuda:0 \
    --batch_size 8

Example Question (ARC-Challenge)

The following is representative of the multi-step reasoning required:

"Which property of a mineral can be determined just by looking at it?" (A) luster (B) mass (C) weight (D) hardness

Reasoning: Mass and weight require measurement tools. Hardness requires a scratch test. Luster is the only visual property.

Licensing and cost

  • Open Data: Yes (CC BY-SA 4.0).
  • Cost: Free to download and use.

Sources / references

Contribution Metadata

  • Last reviewed: 2026-05-19
  • Confidence: high