Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat MT Bench Answer Generation

From Leeroopedia


Field Value
Page Type Principle
Title MT Bench Answer Generation
Repository lm-sys/FastChat
Knowledge Sources Source code analysis of fastchat/llm_judge/gen_model_answer.py, fastchat/llm_judge/common.py
Domains LLM Evaluation, Benchmarking, Multi-Turn Conversation
Last Updated 2026-02-07 14:00 GMT
Implemented By Implementation:Lm_sys_FastChat_Gen_Model_Answer

Overview

MT-Bench Answer Generation is the principle governing how candidate language models produce responses to a standardized set of multi-turn benchmark questions. MT-Bench (Multi-Turn Benchmark) is a curated evaluation suite comprising 80 questions spanning 8 distinct categories, designed to assess a model's ability to engage in coherent, accurate, and contextually appropriate multi-turn dialogue. The answer generation phase is the first step in the MT-Bench evaluation pipeline, where each model under evaluation independently produces answers that are later scored by an LLM judge.

Description

Question Categories

MT-Bench organizes its 80 questions into 8 categories, each targeting a different aspect of language model capability:

Category Description Temperature
writing Creative and structured writing tasks 0.7
roleplay Character and scenario-based interactions 0.7
extraction Information extraction from provided text 0.0
math Mathematical problem solving 0.0
coding Code generation and debugging 0.0
reasoning Logical and analytical reasoning 0.0
stem Science, technology, engineering, and mathematics questions 0.1
humanities History, philosophy, and social science topics 0.1

Two-Turn Conversation Structure

Each MT-Bench question is structured as a two-turn conversation. The first turn presents the initial question or task, and the second turn provides a follow-up that builds upon the first. This design tests the model's ability to:

  • Maintain context across multiple exchanges
  • Handle follow-up instructions that modify, extend, or refine the original task
  • Produce coherent and consistent responses within a dialogue flow

The model must process both turns sequentially, using the conversation template appropriate to its architecture (e.g., ChatML, Llama format, Vicuna format).

Temperature Configuration Per Category

Temperature settings are not uniform across all categories. Categories requiring deterministic, factual, or precise outputs (extraction, math, coding, reasoning) use a temperature of 0.0 (greedy decoding). Categories that benefit from creative variation (writing, roleplay) use a temperature of 0.7. STEM and humanities use a moderate temperature of 0.1. This per-category temperature configuration ensures that evaluation conditions match the nature of the task.

Parallel Model Evaluation with Ray

When multiple GPUs are available, the evaluation workload is distributed across GPU workers using Ray, a distributed computing framework. The system calculates the number of parallel workers as num_gpus_total // num_gpus_per_model. Questions are randomly shuffled to balance load across workers, then split into equal-sized chunks. Each worker independently loads the model and generates answers for its assigned subset of questions. If only a single worker is needed, Ray is not imported or initialized, keeping the dependency optional.

Answer Deduplication by question_id

After all answers are generated (potentially from multiple parallel workers), the answer file undergoes a deduplication and sorting step. The reorg_answer_file function reads all answer records, retains only the last answer per question_id (effectively deduplicating), and writes the results back sorted by question_id. This ensures that:

  • Each question has exactly one answer in the final output
  • Re-running the evaluation for specific questions replaces rather than duplicates entries
  • The output file is deterministically ordered for reproducibility

Usage

MT-Bench Answer Generation is used as the first phase of the MT-Bench evaluation workflow:

  1. Prepare questions: Ensure the question file (data/mt_bench/question.jsonl) is available. Each line is a JSON object with question_id, category, and turns (a list of two strings).
  2. Generate answers: Run the answer generation script for each model to be evaluated. The script loads the model, iterates through all questions, applies category-specific temperature, generates two-turn responses, and writes JSONL output.
  3. Pass to judge: The generated answer files serve as input to the LLM judge evaluation phase (Principle:Lm_sys_FastChat_LLM_Judge_Evaluation).

Multiple models can be evaluated independently (and in parallel across different machines), since each model writes to its own answer file identified by model_id.

Theoretical Basis

The design of MT-Bench Answer Generation draws on several evaluation principles:

  • Multi-turn evaluation: Single-turn benchmarks fail to capture conversational coherence. By requiring two turns, MT-Bench tests whether models can maintain context, follow instructions that build on prior responses, and avoid contradicting themselves.
  • Category-stratified assessment: Different capabilities require different evaluation conditions. Deterministic tasks (math, coding) need greedy decoding to assess peak capability, while creative tasks (writing, roleplay) benefit from sampling to reveal the model's generative range.
  • Reproducibility through seeding: Each choice index uses torch.manual_seed(i) to ensure that, given the same model and temperature, results are reproducible across runs.
  • Scalable evaluation: The Ray-based parallelization pattern allows the same evaluation script to run on a single GPU or scale to a multi-GPU cluster without code changes, following the principle of transparent scalability.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment