EvalPlus¶
What it is¶
EvalPlus is a rigorous evaluation framework for Large Language Models (LLMs) focused on code generation (LLM4Code). It significantly expands existing benchmarks like HumanEval and MBPP with more comprehensive test cases to improve evaluation accuracy.
What problem it solves¶
Original coding benchmarks like HumanEval often have very few test cases, allowing fragile or incorrect code to pass. EvalPlus addresses this "under-testing" problem by adding 80x more tests to HumanEval and 35x more tests to MBPP, revealing model weaknesses that simpler benchmarks miss.
Where it fits in the stack¶
Benchmarking. It is a specialized tool for deeply evaluating the code generation capabilities and efficiency of LLMs.
Typical use cases¶
- Rigorous Coding Evaluation: Testing a model's true coding ability beyond simple benchmarks.
- Fragility Detection: Identifying if a model's generated code is robust across many different inputs.
- Code Efficiency Benchmarking: Using the EvalPerf extension to measure the execution speed of LLM-generated code.
Strengths¶
- High Rigor: Expanded test suites (HumanEval+, MBPP+) significantly reduce false positives.
- Multi-backend Support: Supports evaluation via vLLM, Hugging Face, OpenAI, Anthropic, Gemini, and Ollama.
- Security: Supports safe code execution within Docker containers to protect the host system.
- Performance Evaluation: Includes EvalPerf for measuring code efficiency.
Limitations¶
- Focus: Primarily limited to Python and coding-specific tasks.
- Execution Cost: Running 80x more tests naturally takes more time and compute than the original benchmarks.
When to use it¶
- When you are developing or fine-tuning an LLM for code generation and need high-confidence metrics.
- When you want to rank models based on their coding robustness and efficiency.
- When comparing against major industry models (many of which, like Llama 3.1 and Qwen 2.5, use EvalPlus).
When not to use it¶
- For general knowledge or reasoning tasks (use MMLU or GPQA instead).
- For quick, non-rigorous evaluations of simple code snippets.
Licensing and cost¶
- Open Source: Yes (Apache 2.0)
- Cost: Free
- Self-hostable: Yes
Related tools / concepts¶
Sources / References¶
Contribution Metadata¶
- Last reviewed: 2026-03-21
- Confidence: high