Skip to content

Benchmarking

Standardized tests, leaderboards, and evaluation frameworks used to measure the performance, reasoning, and safety of AI models.

For a conceptual overview of model comparison platforms and evaluation metrics, see Model Comparison and Evaluation.

Contents

Benchmark What it measures
ARC (AI2 Reasoning Challenge) Grade-school science questions requiring reasoning
ASDiv Diverse math word problems
AlpacaEval Simulated user preference for instruction-following
BigCodeBench Practical coding capabilities across various tasks
Chatbot Arena Crowdsourced ELO-based leaderboard of LLM chat
DREAM Benchmark Dialogue-based reading comprehension and reasoning
EvalPlus Rigorous testing of code generation via test-case augmentation
Gpqa Graduate-level science questions that are hard for non-experts
Gsm8k Grade-school math word problems
HELM Holistic Evaluation of Language Models (Stanford CRFM)
Human Eval Python coding tasks (OpenAI)
Humanitys Last Exam Extremely difficult multidisciplinary reasoning test
InterCode Interactive coding benchmarks (SQL, Bash, etc.)
JudgeGPT Using LLMs as judges for model performance evaluation
LangSmith Platform for testing and monitoring agentic applications
Llmperf Benchmarking LLM inference performance (latency, throughput)
Lm Evaluation Harness Unified framework for evaluating models on 200+ benchmarks
LongCLI-Bench Performance and accuracy in long-context terminal tasks
MATH Benchmark High-school competition level math problems
MBPP Mostly Basic Python Problems (Google)
MMLU Massive Multitask Language Understanding (57 subjects)
MT-Bench Multi-turn conversation capabilities
Ollama Benchmark Cli Performance testing for local models running in Ollama
OpenCompass Comprehensive evaluation platform for large models
Pa Bench Privacy and alignment benchmarking
Promptfoo CLI-driven evaluation and red-teaming for prompts
SharpAI Security Benchmark Measuring safety and vulnerability in AI systems
Supermetal Benchmark Evaluation of hardware-specific AI performance
Swe Bench Solving real-world GitHub issues (Software Engineering)
Terminal Bench Model performance in terminal and CLI environments
VAKRA Benchmark Evaluating reasoning in complex low-resource scenarios

Contribution Metadata

  • Last reviewed: 2026-05-12
  • Confidence: high