OpenCompass¶
What it is¶
OpenCompass is a comprehensive, one-stop platform designed for evaluating the capabilities of large language models (LLMs) and vision-language models (VLMs). It provides a complete evaluation pipeline, including dataset preparation, evaluation scripts, and leaderboards.
What problem it solves¶
Evaluating modern large models is complex, requiring diverse datasets and multiple evaluation paradigms (e.g., zero-shot, few-shot, CoT). OpenCompass standardizes this process, providing a reproducible and extensible framework that supports over 100 datasets and various model backends. It addresses the fragmentation of evaluation criteria by providing a unified interface for cross-domain and large-scale model evaluation.
Where it fits in the stack¶
Category: Benchmarking. It serves as an evaluation toolkit and platform for comparing model performance across a wide range of tasks, including linguistic, knowledge, reasoning, coding, and multi-modality.
Typical use cases¶
- Model Development: Benchmarking in-house models against industry standards (e.g., Qwen 3.5, InternVL-U) during training.
- Model Selection: Comparing different open-source or API-based models (GPT-5.2, Claude 4.6) to find the best fit for a specific application.
- VLM & Image Evaluation: Using the GenEditEvalKit (released 2026) to evaluate image generation and editing models across multiple benchmarks.
- Academic Research: Reproducing evaluation results for papers and contributing new datasets to the community via the OpenCompass Academic Leaderboard.
Strengths¶
- Comprehensive Coverage: Supports 100+ datasets, including IFEval, MMLU-Pro, and GPQA.
- Flexible Architecture: Supports multiple evaluation paradigms, including Zero-shot, Few-shot, CoT, and LLM-as-a-judge (CompassJudger).
- High Concurrency: Integrates with acceleration backends like vLLM, LMDeploy, and ModelScope for distributed, high-speed evaluation.
- Unified Multimodal Support: Enhanced support for Unified Multimodal Models (UMMs) and vision-language tasks.
Limitations¶
- Complexity: The extensive configuration system (based on MMEngine) has a steep learning curve.
- Resource Intensive: Running full-scale evaluations on frontier models requires significant local compute or API credits.
When to use it¶
- When you need a standardized, reproducible way to evaluate models across dozens of dimensions.
- For evaluating Vision-Language Models (VLMs) and image generation models.
- When contributing to or comparing against public leaderboards (CompassRank).
When not to use it¶
- For very simple, single-task evaluations where a lightweight script might suffice.
- If you only need to evaluate basic RAG performance (consider RAGAS or DeepEval).
Getting started¶
Installation¶
It is recommended to use a Conda environment for dependency management.
conda create --name opencompass python=3.10 -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -e .
Dataset Preparation¶
Datasets are managed centrally in the data/ directory.
# Download core datasets (2026 baseline)
python tools/download_dataset.py --dataset core
Hello-world Evaluation¶
Evaluate a small model (e.g., Opt-125m) on standard benchmarks:
# Evaluate Opt-125m on MMLU and GSM8K
python run.py --models hf_opt_125m --datasets mmlu_gen gsm8k_gen
Configuration example (Python-style)¶
OpenCompass uses a modular configuration system. Below is an example for evaluating a HuggingFace model:
from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM
with read_base():
# Inherit dataset configurations
from .datasets.mmlu.mmlu_gen import mmlu_datasets
from .datasets.gsm8k.gsm8k_gen import gsm8k_datasets
datasets = [*mmlu_datasets, *gsm8k_datasets]
models = [
dict(
type=HuggingFaceCausalLM,
abbr='llama-4-8b-hf',
path='meta-llama/Llama-4-8B-Instruct',
tokenizer_path='meta-llama/Llama-4-8B-Instruct',
model_kwargs=dict(device_map='auto'),
max_seq_len=4096,
max_out_len=1024,
batch_size=16,
run_cfg=dict(num_gpus=1),
)
]
Advanced: Image Generation Evaluation (GenEditEvalKit)¶
Released in 2026, this kit allows evaluating image generation and editing models:
# Evaluate an image generation model on GenEdit benchmarks
python GenEditEvalKit/run.py --models stable-diffusion-3 --benchmarks GEdit
Licensing and cost¶
- Open Source: Yes (Apache 2.0)
- Cost: Free software (compute/API costs apply)
Related tools / concepts¶
Sources / References¶
Contribution Metadata¶
- Last reviewed: 2026-05-28
- Confidence: high