Benchmarking¶ For a conceptual overview of model comparison platforms and evaluation metrics, see Model Comparison and Evaluation. Chatbot Arena DREAM Benchmark Gpqa Gsm8k Human Eval Humanitys Last Exam LangSmith Llmperf Lm Evaluation Harness LongCLI-Bench Mbpp Ollama Benchmark Cli Pa Bench Swe Bench Terminal Bench