Workflow:Lm sys FastChat MT Bench Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Evaluation, Benchmarking |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
End-to-end process for evaluating language models on MT-bench using LLM-as-a-judge, encompassing answer generation, automated GPT-4 grading, and score reporting.
Description
This workflow implements the MT-bench evaluation pipeline, a standardized benchmark for assessing chat assistant quality. MT-bench consists of 80 challenging multi-turn questions across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities). The evaluation process generates model answers to these questions, then uses GPT-4 as an automated judge to score each response on a scale of 1-10. Three grading modes are supported: single-answer scoring (default), pairwise comparison against a baseline, and pairwise comparison across all model pairs. The pipeline supports parallel answer generation across multiple GPUs and concurrent API calls for judging.
Usage
Execute this workflow when you want to benchmark a new or fine-tuned language model against established baselines. This is the recommended evaluation method for models trained with FastChat. It is particularly useful for comparing fine-tuning strategies, measuring the impact of training data changes, or validating that a model meets quality thresholds before deployment.
Execution Steps
Step 1: Environment Setup
Install FastChat with the llm_judge extra, which includes the MT-bench question set, judge prompt templates, and evaluation scripts. Set the OpenAI API key for GPT-4 judge access.
Key considerations:
- Install: pip install -e ".[model_worker,llm_judge]"
- An OpenAI API key is required for the judging step (GPT-4 access)
- Pre-generated model answers and judgments can be downloaded for comparison
- The qa_browser.py tool allows interactive browsing of results
Step 2: Generate Model Answers
Run the target model against all 80 MT-bench questions (160 turns total, as each question has 2 turns). The script loads the model, applies the correct conversation template based on model type, and generates responses. Answers are saved as JSONL files indexed by question ID.
Key considerations:
- The model path can be a local directory or HuggingFace repo ID
- A unique model-id is assigned for tracking results
- Answers are saved to data/mt_bench/model_answer/[MODEL-ID].jsonl
- Multi-GPU parallelism is supported via --num-gpus-per-model and --num-gpus-total
- For faster generation, vLLM can be used as an alternative backend via gen_api_answer.py
- Temperature configuration varies by question category (e.g., 0.0 for math/coding, 0.7 for creative)
Step 3: Generate GPT-4 Judgments
Submit model answers to GPT-4 for automated evaluation. In the default single-answer mode, GPT-4 grades each response independently on a 1-10 scale with a detailed explanation. The judge uses category-specific prompt templates and can optionally use reference answers for factual categories (math, reasoning, coding).
Key considerations:
- Default mode: single-answer grading (recommended)
- Alternative modes: pairwise-baseline (compare against GPT-3.5-Turbo) and pairwise-all (all pairs)
- Judgments are saved to data/mt_bench/model_judgment/gpt-4_single.jsonl
- The --parallel flag controls concurrent API calls for throughput
- Reference answers from GPT-4 are used for categories requiring factual accuracy
- Each question is judged for both turn 1 and turn 2 independently
Step 4: Display Results
Aggregate judgment scores and display per-model, per-category results. The score display computes average scores across all turns and categories, producing a summary table that enables model comparison. Results can be filtered to specific models.
Key considerations:
- Show all scores: python show_result.py
- Show specific models: python show_result.py --model-list model1 model2
- Scores are broken down by the 8 MT-bench categories
- Both first-turn and second-turn scores are reported separately
- Pairwise modes report win rates instead of absolute scores
- Radar plots can be generated for visual per-category comparison