Skip to content

Home-Office Automation & AI Hub

Overview

joanmarcriera/Home-office-automations

Benchmarking¶

Standardized tests, leaderboards, and evaluation frameworks used to measure the performance, reasoning, and safety of AI models.

For a conceptual overview of model comparison platforms and evaluation metrics, see Model Comparison and Evaluation.

Contents¶

Benchmark	What it measures
ARC (AI2 Reasoning Challenge)	Grade-school science questions requiring reasoning
ASDiv	Diverse math word problems
AlpacaEval	Simulated user preference for instruction-following
BigCodeBench	Practical coding capabilities across various tasks
Chatbot Arena	Crowdsourced ELO-based leaderboard of LLM chat
DREAM Benchmark	Dialogue-based reading comprehension and reasoning
EvalPlus	Rigorous testing of code generation via test-case augmentation
Gpqa	Graduate-level science questions that are hard for non-experts
Gsm8k	Grade-school math word problems
HELM	Holistic Evaluation of Language Models (Stanford CRFM)
Human Eval	Python coding tasks (OpenAI)
Humanitys Last Exam	Extremely difficult multidisciplinary reasoning test
InterCode	Interactive coding benchmarks (SQL, Bash, etc.)
JudgeGPT	Using LLMs as judges for model performance evaluation
LangSmith	Platform for testing and monitoring agentic applications
Llmperf	Benchmarking LLM inference performance (latency, throughput)
Lm Evaluation Harness	Unified framework for evaluating models on 200+ benchmarks
LongCLI-Bench	Performance and accuracy in long-context terminal tasks
MATH Benchmark	High-school competition level math problems
MBPP	Mostly Basic Python Problems (Google)
MMLU	Massive Multitask Language Understanding (57 subjects)
MT-Bench	Multi-turn conversation capabilities
Ollama Benchmark Cli	Performance testing for local models running in Ollama
OpenCompass	Comprehensive evaluation platform for large models
Pa Bench	Privacy and alignment benchmarking
Promptfoo	CLI-driven evaluation and red-teaming for prompts
SharpAI Security Benchmark	Measuring safety and vulnerability in AI systems
Supermetal Benchmark	Evaluation of hardware-specific AI performance
Swe Bench	Solving real-world GitHub issues (Software Engineering)
Terminal Bench	Model performance in terminal and CLI environments
VAKRA Benchmark	Evaluating reasoning in complex low-resource scenarios

Contribution Metadata¶

Last reviewed: 2026-05-12
Confidence: high