Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat MT Bench Result Display

From Leeroopedia


Field Value
Page Type Principle
Title MT Bench Result Display
Repository lm-sys/FastChat
Knowledge Sources Source code analysis of fastchat/llm_judge/show_result.py
Domains LLM Evaluation, Data Analysis, Result Aggregation
Last Updated 2026-02-07 14:00 GMT
Implemented By Implementation:Lm_sys_FastChat_Show_Result

Overview

MT-Bench Result Display is the principle governing how evaluation results from the LLM judge are aggregated, analyzed, and presented to the user. This is the final stage of the MT-Bench evaluation pipeline, transforming raw JSONL judgment records into human-readable summary tables. The system supports two distinct display modes corresponding to the two evaluation paradigms: single-answer score summaries and pairwise win/loss/tie rate tables.

Description

Single-Mode: Mean Scores Per Model by Turn

In single-answer evaluation mode, each model receives a numerical score (1-10) per question per turn from the judge. The result display aggregates these scores by computing the arithmetic mean for each model, broken down by turn:

  • First turn scores: Mean of all scores where turn == 1, grouped by model. This measures the model's ability to handle initial questions across all 8 categories.
  • Second turn scores: Mean of all scores where turn == 2, grouped by model. This measures multi-turn capability -- how well the model handles follow-up questions that build on the first exchange.
  • Average scores: The overall mean across both turns, providing a single summary metric per model.

Scores of -1 (indicating extraction failures) are filtered out before aggregation. Models are ranked by descending score in each table.

Pairwise Mode: Win/Loss/Tie Rates

In pairwise evaluation mode, each comparison between two models yields a winner (or tie) based on the judge's assessment across two games (with swapped positions). The result display computes:

  • Win count: Number of comparisons where the model was judged better
  • Loss count: Number of comparisons where the model was judged worse
  • Tie count: Number of comparisons where the result was a tie (either explicit tie, or disagreement between the two swapped games)
  • Win rate: wins / (wins + losses + ties)
  • Loss rate: losses / (wins + losses + ties)

Adjusted Win Rate

A key metric in pairwise evaluation is the adjusted win rate, which accounts for ties more fairly than the raw win rate. The formula is:

adjusted_win_rate = (wins + 0.5 * ties) / (wins + losses + ties)

This treats each tie as half a win and half a loss, providing a more balanced ranking that does not penalize models simply for having many close matchups. The adjusted win rate ranges from 0.0 (all losses) to 1.0 (all wins), with 0.5 representing a perfectly even record. Models are ranked by descending adjusted win rate.

Per-Category Analysis

While the current display functions aggregate across all categories, the underlying data supports per-category breakdown. Each judgment record retains the question's category, enabling users to filter or group results to understand model performance in specific domains (e.g., coding vs. writing). The framework supports this through the model_list filter parameter.

Filtering by Model List

Both single and pairwise display modes support an optional model list filter. When provided, only the specified models are included in the output tables. This is useful for:

  • Comparing a specific subset of models without noise from others
  • Focusing on newly added models against established baselines
  • Generating targeted comparison reports

In pairwise mode, additional filtering by baseline_model restricts results to comparisons involving that specific baseline.

Usage

The result display principle is applied in the final phase of the MT-Bench workflow:

  1. Complete judgment generation: Ensure that LLM judge evaluation has been run for all models of interest.
  2. Select display mode: Choose single for absolute score summaries or pairwise-baseline/pairwise-all for relative comparisons.
  3. Run the display script: The system reads the JSONL judgment files, aggregates the data using pandas, and prints formatted tables to stdout.
  4. Interpret results: For single mode, higher mean scores indicate better performance. For pairwise mode, higher adjusted win rates indicate stronger relative performance.

Theoretical Basis

The result display design is grounded in several statistical and evaluation principles:

  • Per-turn decomposition: Separating first-turn and second-turn scores reveals whether a model's strength lies in initial comprehension or in maintaining conversational coherence. A model with high first-turn but low second-turn scores may struggle with context retention.
  • Adjusted win rate as a robust metric: The naive win rate ignores ties, which can distort rankings when tie rates vary significantly between models. The adjusted win rate (equivalent to the Bradley-Terry model's win probability estimate under certain assumptions) provides a more robust comparison by treating ties as partial evidence for both models.
  • Position-bias-aware tie classification: In pairwise mode, a "tie" can arise from two sources: (a) the judge explicitly declares a tie, or (b) the two games (with swapped positions) disagree on the winner. The latter is treated as a tie because the disagreement likely reflects position bias rather than a genuine quality difference.
  • Error filtering: Judgment extraction failures (score == -1 in single mode, "error" winners in pairwise mode) are excluded from aggregation to prevent corrupted data from skewing results.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment