Principle:Lm sys FastChat MT Bench Result Display
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | MT Bench Result Display |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source code analysis of fastchat/llm_judge/show_result.py
|
| Domains | LLM Evaluation, Data Analysis, Result Aggregation |
| Last Updated | 2026-02-07 14:00 GMT |
| Implemented By | Implementation:Lm_sys_FastChat_Show_Result |
Overview
MT-Bench Result Display is the principle governing how evaluation results from the LLM judge are aggregated, analyzed, and presented to the user. This is the final stage of the MT-Bench evaluation pipeline, transforming raw JSONL judgment records into human-readable summary tables. The system supports two distinct display modes corresponding to the two evaluation paradigms: single-answer score summaries and pairwise win/loss/tie rate tables.
Description
Single-Mode: Mean Scores Per Model by Turn
In single-answer evaluation mode, each model receives a numerical score (1-10) per question per turn from the judge. The result display aggregates these scores by computing the arithmetic mean for each model, broken down by turn:
- First turn scores: Mean of all scores where
turn == 1, grouped by model. This measures the model's ability to handle initial questions across all 8 categories. - Second turn scores: Mean of all scores where
turn == 2, grouped by model. This measures multi-turn capability -- how well the model handles follow-up questions that build on the first exchange. - Average scores: The overall mean across both turns, providing a single summary metric per model.
Scores of -1 (indicating extraction failures) are filtered out before aggregation. Models are ranked by descending score in each table.
Pairwise Mode: Win/Loss/Tie Rates
In pairwise evaluation mode, each comparison between two models yields a winner (or tie) based on the judge's assessment across two games (with swapped positions). The result display computes:
- Win count: Number of comparisons where the model was judged better
- Loss count: Number of comparisons where the model was judged worse
- Tie count: Number of comparisons where the result was a tie (either explicit tie, or disagreement between the two swapped games)
- Win rate:
wins / (wins + losses + ties) - Loss rate:
losses / (wins + losses + ties)
Adjusted Win Rate
A key metric in pairwise evaluation is the adjusted win rate, which accounts for ties more fairly than the raw win rate. The formula is:
adjusted_win_rate = (wins + 0.5 * ties) / (wins + losses + ties)
This treats each tie as half a win and half a loss, providing a more balanced ranking that does not penalize models simply for having many close matchups. The adjusted win rate ranges from 0.0 (all losses) to 1.0 (all wins), with 0.5 representing a perfectly even record. Models are ranked by descending adjusted win rate.
Per-Category Analysis
While the current display functions aggregate across all categories, the underlying data supports per-category breakdown. Each judgment record retains the question's category, enabling users to filter or group results to understand model performance in specific domains (e.g., coding vs. writing). The framework supports this through the model_list filter parameter.
Filtering by Model List
Both single and pairwise display modes support an optional model list filter. When provided, only the specified models are included in the output tables. This is useful for:
- Comparing a specific subset of models without noise from others
- Focusing on newly added models against established baselines
- Generating targeted comparison reports
In pairwise mode, additional filtering by baseline_model restricts results to comparisons involving that specific baseline.
Usage
The result display principle is applied in the final phase of the MT-Bench workflow:
- Complete judgment generation: Ensure that LLM judge evaluation has been run for all models of interest.
- Select display mode: Choose
singlefor absolute score summaries orpairwise-baseline/pairwise-allfor relative comparisons. - Run the display script: The system reads the JSONL judgment files, aggregates the data using pandas, and prints formatted tables to stdout.
- Interpret results: For single mode, higher mean scores indicate better performance. For pairwise mode, higher adjusted win rates indicate stronger relative performance.
Theoretical Basis
The result display design is grounded in several statistical and evaluation principles:
- Per-turn decomposition: Separating first-turn and second-turn scores reveals whether a model's strength lies in initial comprehension or in maintaining conversational coherence. A model with high first-turn but low second-turn scores may struggle with context retention.
- Adjusted win rate as a robust metric: The naive win rate ignores ties, which can distort rankings when tie rates vary significantly between models. The adjusted win rate (equivalent to the Bradley-Terry model's win probability estimate under certain assumptions) provides a more robust comparison by treating ties as partial evidence for both models.
- Position-bias-aware tie classification: In pairwise mode, a "tie" can arise from two sources: (a) the judge explicitly declares a tie, or (b) the two games (with swapped positions) disagree on the winner. The latter is treated as a tie because the disagreement likely reflects position bias rather than a genuine quality difference.
- Error filtering: Judgment extraction failures (score == -1 in single mode, "error" winners in pairwise mode) are excluded from aggregation to prevent corrupted data from skewing results.
Related Pages
- Implementation:Lm_sys_FastChat_Show_Result
- Implementation:Lm_sys_FastChat_Show_Result -- The implementation that realizes this principle
- Principle:Lm_sys_FastChat_LLM_Judge_Evaluation -- The preceding phase that generates the judgment data
- Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation -- The first phase of the MT-Bench pipeline