Workflow:Lm sys FastChat MT Bench Evaluation

Knowledge Sources	FastChat Judging LLM-as-a-Judge MT-Bench Leaderboard
Domains	LLMs, Evaluation, Benchmarking
Last Updated	2026-02-07 04:00 GMT

Overview

End-to-end process for evaluating language models on MT-bench using LLM-as-a-judge, encompassing answer generation, automated GPT-4 grading, and score reporting.

Description

This workflow implements the MT-bench evaluation pipeline, a standardized benchmark for assessing chat assistant quality. MT-bench consists of 80 challenging multi-turn questions across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities). The evaluation process generates model answers to these questions, then uses GPT-4 as an automated judge to score each response on a scale of 1-10. Three grading modes are supported: single-answer scoring (default), pairwise comparison against a baseline, and pairwise comparison across all model pairs. The pipeline supports parallel answer generation across multiple GPUs and concurrent API calls for judging.

Usage

Execute this workflow when you want to benchmark a new or fine-tuned language model against established baselines. This is the recommended evaluation method for models trained with FastChat. It is particularly useful for comparing fine-tuning strategies, measuring the impact of training data changes, or validating that a model meets quality thresholds before deployment.

Execution Steps

Step 1: Environment Setup

Install FastChat with the llm_judge extra, which includes the MT-bench question set, judge prompt templates, and evaluation scripts. Set the OpenAI API key for GPT-4 judge access.

Key considerations:

Install: pip install -e ".[model_worker,llm_judge]"
An OpenAI API key is required for the judging step (GPT-4 access)
Pre-generated model answers and judgments can be downloaded for comparison
The qa_browser.py tool allows interactive browsing of results

Step 2: Generate Model Answers

Run the target model against all 80 MT-bench questions (160 turns total, as each question has 2 turns). The script loads the model, applies the correct conversation template based on model type, and generates responses. Answers are saved as JSONL files indexed by question ID.

Key considerations:

The model path can be a local directory or HuggingFace repo ID
A unique model-id is assigned for tracking results
Answers are saved to data/mt_bench/model_answer/[MODEL-ID].jsonl
Multi-GPU parallelism is supported via --num-gpus-per-model and --num-gpus-total
For faster generation, vLLM can be used as an alternative backend via gen_api_answer.py
Temperature configuration varies by question category (e.g., 0.0 for math/coding, 0.7 for creative)

Step 3: Generate GPT-4 Judgments

Submit model answers to GPT-4 for automated evaluation. In the default single-answer mode, GPT-4 grades each response independently on a 1-10 scale with a detailed explanation. The judge uses category-specific prompt templates and can optionally use reference answers for factual categories (math, reasoning, coding).

Key considerations:

Default mode: single-answer grading (recommended)
Alternative modes: pairwise-baseline (compare against GPT-3.5-Turbo) and pairwise-all (all pairs)
Judgments are saved to data/mt_bench/model_judgment/gpt-4_single.jsonl
The --parallel flag controls concurrent API calls for throughput
Reference answers from GPT-4 are used for categories requiring factual accuracy
Each question is judged for both turn 1 and turn 2 independently

Step 4: Display Results

Aggregate judgment scores and display per-model, per-category results. The score display computes average scores across all turns and categories, producing a summary table that enables model comparison. Results can be filtered to specific models.

Key considerations:

Show all scores: python show_result.py
Show specific models: python show_result.py --model-list model1 model2
Scores are broken down by the 8 MT-bench categories
Both first-turn and second-turn scores are reported separately
Pairwise modes report win rates instead of absolute scores
Radar plots can be generated for visual per-category comparison

Execution Diagram

GitHub URL

Workflow Repository