Benchmarking¶
Standardized tests, leaderboards, and evaluation frameworks used to measure the performance, reasoning, and safety of AI models.
For a conceptual overview of model comparison platforms and evaluation metrics, see Model Comparison and Evaluation.
Contents¶
| Benchmark | What it measures |
|---|---|
| ARC (AI2 Reasoning Challenge) | Grade-school science questions requiring reasoning |
| ASDiv | Diverse math word problems |
| AlpacaEval | Simulated user preference for instruction-following |
| BigCodeBench | Practical coding capabilities across various tasks |
| Chatbot Arena | Crowdsourced ELO-based leaderboard of LLM chat |
| DREAM Benchmark | Dialogue-based reading comprehension and reasoning |
| EvalPlus | Rigorous testing of code generation via test-case augmentation |
| Gpqa | Graduate-level science questions that are hard for non-experts |
| Gsm8k | Grade-school math word problems |
| HELM | Holistic Evaluation of Language Models (Stanford CRFM) |
| Human Eval | Python coding tasks (OpenAI) |
| Humanitys Last Exam | Extremely difficult multidisciplinary reasoning test |
| InterCode | Interactive coding benchmarks (SQL, Bash, etc.) |
| JudgeGPT | Using LLMs as judges for model performance evaluation |
| LangSmith | Platform for testing and monitoring agentic applications |
| Llmperf | Benchmarking LLM inference performance (latency, throughput) |
| Lm Evaluation Harness | Unified framework for evaluating models on 200+ benchmarks |
| LongCLI-Bench | Performance and accuracy in long-context terminal tasks |
| MATH Benchmark | High-school competition level math problems |
| MBPP | Mostly Basic Python Problems (Google) |
| MMLU | Massive Multitask Language Understanding (57 subjects) |
| MT-Bench | Multi-turn conversation capabilities |
| Ollama Benchmark Cli | Performance testing for local models running in Ollama |
| OpenCompass | Comprehensive evaluation platform for large models |
| Pa Bench | Privacy and alignment benchmarking |
| Promptfoo | CLI-driven evaluation and red-teaming for prompts |
| SharpAI Security Benchmark | Measuring safety and vulnerability in AI systems |
| Supermetal Benchmark | Evaluation of hardware-specific AI performance |
| Swe Bench | Solving real-world GitHub issues (Software Engineering) |
| Terminal Bench | Model performance in terminal and CLI environments |
| VAKRA Benchmark | Evaluating reasoning in complex low-resource scenarios |
Related tools / concepts¶
- Model Comparison & Evaluation
- Process & Understanding (Observability)
- Agentic Workflows Patterns
- AI Signal Sources
Contribution Metadata¶
- Last reviewed: 2026-05-12
- Confidence: high