MT-Bench¶

What it is¶

MT-Bench is a benchmark designed to evaluate the multi-turn conversational capabilities of Large Language Models (LLMs). It consists of 80 high-quality, multi-turn questions across eight categories: writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).

What problem it solves¶

Many traditional benchmarks only evaluate single-turn responses, failing to capture a model's ability to maintain context, follow instructions across multiple exchanges, and handle the dynamic nature of real-world conversations.

Where it fits in the stack¶

Benchmarking. It is a core component of the LMSYS FastChat evaluation framework, providing a more rigorous test of conversational flow than single-turn evaluations.

Typical use cases¶

Conversational AI Evaluation: Assessing how well a chatbot handles follow-up questions and maintains context.
Model Comparison: Ranking chat-tuned models based on their ability to handle complex, multi-step instructions.
LLM-as-a-Judge Validation: MT-Bench is often used with GPT-4 as a judge to provide automated, scalable scoring that correlates with human judgment.

Strengths¶

Multi-turn Focus: Specifically designed to test conversation depth.
Diverse Categories: Covers a wide range of tasks from coding to roleplay.
Strong Human Correlation: GPT-4 based scoring on MT-Bench shows over 80% agreement with human experts.
Open Dataset: The questions and human judgments are publicly available for research.

Limitations¶

Judge Bias: If using an LLM as a judge, it may inherit the biases of that judge (e.g., preference for certain styles or lengths).
Scale: With 80 questions, it is smaller than some "massive" benchmarks, though the multi-turn nature adds complexity.
Static Nature: Like all fixed benchmarks, it risks data contamination over time.

When to use it¶

When evaluating chat-tuned models where multi-turn interaction is a primary use case.
When you need an automated conversational benchmark that aligns closely with human preference.

When not to use it¶

For evaluating base (non-chat-tuned) models that are not designed for dialogue.
When you only need to measure narrow technical capabilities like raw code execution or mathematical proof (use specialized benchmarks instead).

Sources / references¶

Contribution Metadata¶

Last reviewed: 2026-04-08
Confidence: high