| OpenCompass |
https://opencompass.org.cn/ |
tool |
integrated |
OpenCompass |
Discovered in docs/tools/benchmarking/lm-evaluation-harness.md |
| HELM |
https://crfm.stanford.edu/helm/lite/ |
tool |
integrated |
HELM |
Discovered in docs/tools/benchmarking/lm-evaluation-harness.md |
| MMLU |
https://github.com/hendrycks/test |
tool |
integrated |
2026-04-07 |
Discovered in docs/tools/benchmarking/humanitys-last-exam.md |
| EvalPlus |
https://github.com/evalplus/evalplus |
tool |
integrated |
EvalPlus |
Discovered in docs/tools/benchmarking/mbpp.md |
| AlpacaEval |
https://github.com/tatsu-lab/alpaca_eval |
tool |
integrated |
2026-04-07 |
Discovered in docs/tools/benchmarking/chatbot-arena.md |
| MT-Bench |
https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge |
tool |
integrated |
2026-04-07 |
Discovered in docs/tools/benchmarking/chatbot-arena.md |
| InterCode |
https://github.com/princeton-nlp/intercode |
tool |
integrated |
2026-04-07 |
Discovered in docs/tools/benchmarking/terminal-bench.md |
Simple time command with curl |
https://github.com/ollama/ollama/blob/main/docs/api.md |
tool |
integrated |
2026-04-07 |
Discovered in docs/tools/benchmarking/ollama-benchmark-cli.md |
| MATH Benchmark |
https://github.com/hendrycks/math |
tool |
integrated |
2026-04-07 |
Discovered in docs/tools/benchmarking/gsm8k.md |
| ASDiv |
https://github.com/chiahsuan/ASDiv |
tool |
integrated |
2026-04-07 |
Discovered in docs/tools/benchmarking/gsm8k.md |
| ARC (AI2 Reasoning Challenge) |
https://github.com/allenai/ARC-benchmark |
tool |
integrated |
2026-04-07 |
Discovered in docs/tools/benchmarking/gpqa.md |
| BigCodeBench |
https://github.com/bigcode-project/bigcodebench |
tool |
integrated |
2026-04-07 |
Discovered in docs/tools/benchmarking/human-eval.md |