Skip to content

MT-Bench

What it is

MT-Bench is a benchmark designed to evaluate the multi-turn conversational capabilities of Large Language Models (LLMs). It consists of 80 high-quality, multi-turn questions across eight categories: writing, roleplay, extraction, reasoning, math, coding, knowledge I (STEM), and knowledge II (humanities/social science).

What problem it solves

Many traditional benchmarks only evaluate single-turn responses, failing to capture a model's ability to maintain context, follow instructions across multiple exchanges, and handle the dynamic nature of real-world conversations.

Where it fits in the stack

Benchmarking. It is a core component of the LMSYS FastChat evaluation framework, providing a more rigorous test of conversational flow than single-turn evaluations.

Typical use cases

  • Conversational AI Evaluation: Assessing how well a chatbot handles follow-up questions and maintains context.
  • Model Comparison: Ranking chat-tuned models based on their ability to handle complex, multi-step instructions.
  • LLM-as-a-Judge Validation: MT-Bench is often used with GPT-4 as a judge to provide automated, scalable scoring that correlates with human judgment.

Strengths

  • Multi-turn Focus: Specifically designed to test conversation depth.
  • Diverse Categories: Covers a wide range of tasks from coding to roleplay.
  • Strong Human Correlation: GPT-4 based scoring on MT-Bench shows over 80% agreement with human experts.
  • Open Dataset: The questions and human judgments are publicly available for research.

Limitations

  • Judge Bias: If using an LLM as a judge, it may inherit the biases of that judge (e.g., preference for certain styles or lengths).
  • Scale: With 80 questions, it is smaller than some "massive" benchmarks, though the multi-turn nature adds complexity.
  • Static Nature: Like all fixed benchmarks, it risks data contamination over time.

When to use it

  • When evaluating chat-tuned models where multi-turn interaction is a primary use case.
  • When you need an automated conversational benchmark that aligns closely with human preference.

When not to use it

  • For evaluating base (non-chat-tuned) models that are not designed for dialogue.
  • When you only need to measure narrow technical capabilities like raw code execution or mathematical proof (use specialized benchmarks instead).

Sources / references

Contribution Metadata

  • Last reviewed: 2026-04-08
  • Confidence: high