Skip to content

New Sources Log — 2026-03-20

Benchmarking

Title URL Tags Status Canonical Page Notes
OpenCompass https://opencompass.org.cn/ tool integrated OpenCompass Discovered in docs/tools/benchmarking/lm-evaluation-harness.md
HELM https://crfm.stanford.edu/helm/lite/ tool integrated HELM Discovered in docs/tools/benchmarking/lm-evaluation-harness.md
MMLU https://github.com/hendrycks/test tool integrated 2026-04-07 Discovered in docs/tools/benchmarking/humanitys-last-exam.md
EvalPlus https://github.com/evalplus/evalplus tool integrated EvalPlus Discovered in docs/tools/benchmarking/mbpp.md
AlpacaEval https://github.com/tatsu-lab/alpaca_eval tool integrated 2026-04-07 Discovered in docs/tools/benchmarking/chatbot-arena.md
MT-Bench https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge tool integrated 2026-04-07 Discovered in docs/tools/benchmarking/chatbot-arena.md
InterCode https://github.com/princeton-nlp/intercode tool integrated 2026-04-07 Discovered in docs/tools/benchmarking/terminal-bench.md
Simple time command with curl https://github.com/ollama/ollama/blob/main/docs/api.md tool integrated 2026-04-07 Discovered in docs/tools/benchmarking/ollama-benchmark-cli.md
MATH Benchmark https://github.com/hendrycks/math tool integrated 2026-04-07 Discovered in docs/tools/benchmarking/gsm8k.md
ASDiv https://github.com/chiahsuan/ASDiv tool integrated 2026-04-07 Discovered in docs/tools/benchmarking/gsm8k.md
ARC (AI2 Reasoning Challenge) https://github.com/allenai/ARC-benchmark tool integrated 2026-04-07 Discovered in docs/tools/benchmarking/gpqa.md
BigCodeBench https://github.com/bigcode-project/bigcodebench tool integrated 2026-04-07 Discovered in docs/tools/benchmarking/human-eval.md