New Sources Log — 2026-03-20¶

Benchmarking¶

Title	URL	Tags	Status	Canonical Page	Notes
OpenCompass	https://opencompass.org.cn/	tool	integrated	OpenCompass	Discovered in docs/tools/benchmarking/lm-evaluation-harness.md
HELM	https://crfm.stanford.edu/helm/lite/	tool	integrated	HELM	Discovered in docs/tools/benchmarking/lm-evaluation-harness.md
MMLU	https://github.com/hendrycks/test	tool	integrated	2026-04-07	Discovered in docs/tools/benchmarking/humanitys-last-exam.md
EvalPlus	https://github.com/evalplus/evalplus	tool	integrated	EvalPlus	Discovered in docs/tools/benchmarking/mbpp.md
AlpacaEval	https://github.com/tatsu-lab/alpaca_eval	tool	integrated	2026-04-07	Discovered in docs/tools/benchmarking/chatbot-arena.md
MT-Bench	https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge	tool	integrated	2026-04-07	Discovered in docs/tools/benchmarking/chatbot-arena.md
InterCode	https://github.com/princeton-nlp/intercode	tool	integrated	2026-04-07	Discovered in docs/tools/benchmarking/terminal-bench.md
Simple `time` command with `curl`	https://github.com/ollama/ollama/blob/main/docs/api.md	tool	integrated	2026-04-07	Discovered in docs/tools/benchmarking/ollama-benchmark-cli.md
MATH Benchmark	https://github.com/hendrycks/math	tool	integrated	2026-04-07	Discovered in docs/tools/benchmarking/gsm8k.md
ASDiv	https://github.com/chiahsuan/ASDiv	tool	integrated	2026-04-07	Discovered in docs/tools/benchmarking/gsm8k.md
ARC (AI2 Reasoning Challenge)	https://github.com/allenai/ARC-benchmark	tool	integrated	2026-04-07	Discovered in docs/tools/benchmarking/gpqa.md
BigCodeBench	https://github.com/bigcode-project/bigcodebench	tool	integrated	2026-04-07	Discovered in docs/tools/benchmarking/human-eval.md