| MMLU |
https://github.com/hendrycks/test?update=2026-04-07 |
tool |
integrated |
mmlu |
Discovered in humanitys-last-exam.md |
| AlpacaEval |
https://github.com/tatsu-lab/alpaca_eval?update=2026-04-07 |
tool |
integrated |
alpaca-eval |
Discovered in chatbot-arena.md |
| MT-Bench |
https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge?update=2026-04-07 |
tool |
integrated |
mt-bench |
Discovered in chatbot-arena.md |
| InterCode |
https://github.com/princeton-nlp/intercode?update=2026-04-07 |
tool |
integrated |
intercode |
Discovered in terminal-bench.md |
Simple time command with curl |
https://github.com/ollama/ollama/blob/main/docs/api.md?update=2026-04-07 |
tool |
integrated |
ollama-benchmark-cli |
Discovered in ollama-benchmark-cli.md |
| MATH Benchmark |
https://github.com/hendrycks/math?update=2026-04-07 |
tool |
integrated |
math-benchmark |
Discovered in gsm8k.md |
| ASDiv |
https://github.com/chiahsuan/ASDiv?update=2026-04-07 |
tool |
integrated |
asdiv |
Discovered in gsm8k.md |
| ARC (AI2 Reasoning Challenge) |
https://github.com/allenai/ARC-benchmark?update=2026-04-07 |
tool |
integrated |
arc |
Discovered in gpqa.md |
| BigCodeBench |
https://github.com/bigcode-project/bigcodebench?update=2026-04-07 |
tool |
integrated |
bigcodebench |
Discovered in human-eval.md |
| EvalPlus |
https://github.com/evalplus/evalplus?update=2026-04-07 |
repository |
integrated |
evalplus |
Staged from previous logs. |
| HELM |
https://crfm.stanford.edu/helm/lite/?update=2026-04-07 |
tool |
integrated |
helm |
Staged from previous logs. |
| OpenCompass |
https://opencompass.org.cn/?update=2026-04-07 |
tool |
integrated |
opencompass |
Staged from previous logs. |