Implementation:Lm sys FastChat Show Result

Field	Value
Page Type	Implementation
Title	Show Result
Repository	lm-sys/FastChat
Knowledge Sources	Source code analysis of `fastchat/llm_judge/show_result.py`
Domains	LLM Evaluation, Data Analysis, Result Aggregation
Last Updated	2026-02-07 14:00 GMT
Implements	Principle:Lm_sys_FastChat_MT_Bench_Result_Display

Overview

This implementation provides two display functions -- display_result_single and display_result_pairwise -- that read JSONL judgment files, aggregate scores or win/loss/tie counts using pandas, and print formatted summary tables to stdout. The module serves as the final reporting step in the MT-Bench evaluation pipeline.

Description

display_result_single

This function handles single-answer grading results:

Reads the JSONL judgment file into a pandas DataFrame using pd.read_json(input_file, lines=True).
Selects the model, score, and turn columns.
Filters out rows where score == -1 (extraction failures).
Optionally filters to only models in args.model_list.
Prints three tables (for mt_bench):
- First turn: Filters to turn == 1, groups by (model, turn), computes mean score, sorts descending.
- Second turn: Filters to turn == 2, groups by (model, turn), computes mean score, sorts descending.
- Average: Groups by model only, computes mean score across both turns, sorts descending.

display_result_pairwise

This function handles pairwise comparison results:

Reads the JSONL judgment file into a pandas DataFrame.
Filters out rows where g1_winner or g2_winner is "error".
Iterates row by row to classify each comparison:
- If g1_winner == "tie" or g1_winner != g2_winner (games disagree): both models get a tie.
- If both games agree: the winner gets a win, the loser gets a loss.
Groups by model and sums win/loss/tie counts.
Removes the baseline model from the output (if filtering by baseline).
Computes:
- win_rate = win / (win + loss + tie)
- loss_rate = loss / (win + loss + tie)
- win_rate_adjusted = (win + 0.5 * tie) / (win + loss + tie)
Prints the table sorted by win_rate_adjusted descending.

Usage

Command-Line Interface

# Display single-answer grading results
python3 -m fastchat.llm_judge.show_result --mode single

# Display pairwise comparison results against baseline
python3 -m fastchat.llm_judge.show_result --mode pairwise-baseline

# Display pairwise comparison results among all pairs
python3 -m fastchat.llm_judge.show_result --mode pairwise-all

# Filter to specific models
python3 -m fastchat.llm_judge.show_result --mode single --model-list vicuna-7b-v1.5 llama-2-7b-chat

# Use a custom input file
python3 -m fastchat.llm_judge.show_result --mode single --input-file /path/to/judgments.jsonl

CLI Parameters

Parameter	Type	Default	Description
`--bench-name`	str	`"mt_bench"`	Name of the benchmark (used to locate default input files)
`--input-file`	str	None	Custom input file path (auto-generated from bench-name and judge-model if not set)
`--judge-model`	str	`"gpt-4"`	Judge model (used to locate default input files)
`--baseline-model`	str	`"gpt-3.5-turbo"`	Baseline model for pairwise filtering
`--model-list`	list[str]	None	Optional list of models to include in results
`--mode`	str	`"single"`	Display mode: `"single"`, `"pairwise-baseline"`, or `"pairwise-all"`

Programmatic Usage

from fastchat.llm_judge.show_result import display_result_single, display_result_pairwise
import argparse

# Create args namespace for single-mode display
args = argparse.Namespace(
    bench_name="mt_bench",
    input_file=None,
    judge_model="gpt-4",
    model_list=["vicuna-7b-v1.5", "llama-2-7b-chat"],
)
display_result_single(args)

# Create args namespace for pairwise-mode display
args = argparse.Namespace(
    bench_name="mt_bench",
    input_file=None,
    judge_model="gpt-4",
    baseline_model="gpt-3.5-turbo",
    model_list=None,
)
display_result_pairwise(args)

Code Reference

Source Location

Function	File	Lines
`display_result_single`	`fastchat/llm_judge/show_result.py`	L9-36
`display_result_pairwise`	`fastchat/llm_judge/show_result.py`	L39-92

Signature

def display_result_single(args) -> None:
    """Read JSONL judgment file, group by model and turn, compute mean scores, print tables.

    Args:
        args: Namespace with bench_name, input_file, judge_model, model_list attributes.
    """
    ...

def display_result_pairwise(args) -> None:
    """Read JSONL judgment file, compute win/loss/tie rates and adjusted win rate, print table.

    Args:
        args: Namespace with bench_name, input_file, judge_model, baseline_model, model_list attributes.
    """
    ...

Import

from fastchat.llm_judge.show_result import display_result_single, display_result_pairwise

I/O Contract

Inputs

Input	Format	Description
Single-mode judgment file	JSONL (`data/mt_bench/model_judgment/{judge_model}_single.jsonl`)	Each line contains `model`, `score` (float), `turn` (int), and other metadata
Pairwise-mode judgment file	JSONL (`data/mt_bench/model_judgment/{judge_model}_pair.jsonl`)	Each line contains `model_1`, `model_2`, `g1_winner`, `g2_winner`, and other metadata

Outputs

Output	Format	Description
Single-mode tables	Printed pandas DataFrames to stdout	Three tables: first turn scores, second turn scores, and average scores per model
Pairwise-mode table	Printed pandas DataFrame to stdout	One table with win, loss, tie counts, win_rate, loss_rate, and win_rate_adjusted per model

Example single-mode output:

########## First turn ##########
                          score
model              turn
vicuna-7b-v1.5     1      6.85
llama-2-7b-chat    1      6.20

########## Second turn ##########
                          score
model              turn
vicuna-7b-v1.5     2      5.95
llama-2-7b-chat    2      5.40

########## Average ##########
                    score
model
vicuna-7b-v1.5      6.40
llama-2-7b-chat      5.80

Example pairwise-mode output:

                    win  loss  tie  win_rate  loss_rate  win_rate_adjusted
model
vicuna-7b-v1.5       45    25   10    0.5625     0.3125             0.6250
llama-2-7b-chat      30    35   15    0.3750     0.4375             0.4688

Usage Examples

Quick Model Comparison (Single Mode)

# After running gen_judgment.py in single mode:
python3 -m fastchat.llm_judge.show_result \
    --mode single \
    --judge-model gpt-4 \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat

Pairwise Leaderboard Against GPT-3.5-Turbo

# After running gen_judgment.py in pairwise-baseline mode:
python3 -m fastchat.llm_judge.show_result \
    --mode pairwise-baseline \
    --judge-model gpt-4 \
    --baseline-model gpt-3.5-turbo

All-Pairs Comparison

# After running gen_judgment.py in pairwise-all mode:
python3 -m fastchat.llm_judge.show_result \
    --mode pairwise-all \
    --judge-model gpt-4

In pairwise-all mode, the baseline_model is automatically set to None, and all model pairs are included in the results.

Related Pages

Principle:Lm_sys_FastChat_MT_Bench_Result_Display
Principle:Lm_sys_FastChat_MT_Bench_Result_Display -- The principle this implementation realizes
Implementation:Lm_sys_FastChat_Gen_Judgment -- The preceding step: generating judge evaluations
Implementation:Lm_sys_FastChat_Gen_Model_Answer -- The first step: generating model answers

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment