Implementation:Lm sys FastChat Show Result
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Show Result |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source code analysis of fastchat/llm_judge/show_result.py
|
| Domains | LLM Evaluation, Data Analysis, Result Aggregation |
| Last Updated | 2026-02-07 14:00 GMT |
| Implements | Principle:Lm_sys_FastChat_MT_Bench_Result_Display |
Overview
This implementation provides two display functions -- display_result_single and display_result_pairwise -- that read JSONL judgment files, aggregate scores or win/loss/tie counts using pandas, and print formatted summary tables to stdout. The module serves as the final reporting step in the MT-Bench evaluation pipeline.
Description
display_result_single
This function handles single-answer grading results:
- Reads the JSONL judgment file into a pandas DataFrame using
pd.read_json(input_file, lines=True). - Selects the
model,score, andturncolumns. - Filters out rows where
score == -1(extraction failures). - Optionally filters to only models in
args.model_list. - Prints three tables (for
mt_bench):- First turn: Filters to
turn == 1, groups by(model, turn), computes mean score, sorts descending. - Second turn: Filters to
turn == 2, groups by(model, turn), computes mean score, sorts descending. - Average: Groups by
modelonly, computes mean score across both turns, sorts descending.
- First turn: Filters to
display_result_pairwise
This function handles pairwise comparison results:
- Reads the JSONL judgment file into a pandas DataFrame.
- Filters out rows where
g1_winnerorg2_winneris"error". - Iterates row by row to classify each comparison:
- If
g1_winner == "tie"org1_winner != g2_winner(games disagree): both models get a tie. - If both games agree: the winner gets a win, the loser gets a loss.
- If
- Groups by model and sums win/loss/tie counts.
- Removes the baseline model from the output (if filtering by baseline).
- Computes:
win_rate = win / (win + loss + tie)loss_rate = loss / (win + loss + tie)win_rate_adjusted = (win + 0.5 * tie) / (win + loss + tie)
- Prints the table sorted by
win_rate_adjusteddescending.
Usage
Command-Line Interface
# Display single-answer grading results
python3 -m fastchat.llm_judge.show_result --mode single
# Display pairwise comparison results against baseline
python3 -m fastchat.llm_judge.show_result --mode pairwise-baseline
# Display pairwise comparison results among all pairs
python3 -m fastchat.llm_judge.show_result --mode pairwise-all
# Filter to specific models
python3 -m fastchat.llm_judge.show_result --mode single --model-list vicuna-7b-v1.5 llama-2-7b-chat
# Use a custom input file
python3 -m fastchat.llm_judge.show_result --mode single --input-file /path/to/judgments.jsonl
CLI Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--bench-name |
str | "mt_bench" |
Name of the benchmark (used to locate default input files) |
--input-file |
str | None | Custom input file path (auto-generated from bench-name and judge-model if not set) |
--judge-model |
str | "gpt-4" |
Judge model (used to locate default input files) |
--baseline-model |
str | "gpt-3.5-turbo" |
Baseline model for pairwise filtering |
--model-list |
list[str] | None | Optional list of models to include in results |
--mode |
str | "single" |
Display mode: "single", "pairwise-baseline", or "pairwise-all"
|
Programmatic Usage
from fastchat.llm_judge.show_result import display_result_single, display_result_pairwise
import argparse
# Create args namespace for single-mode display
args = argparse.Namespace(
bench_name="mt_bench",
input_file=None,
judge_model="gpt-4",
model_list=["vicuna-7b-v1.5", "llama-2-7b-chat"],
)
display_result_single(args)
# Create args namespace for pairwise-mode display
args = argparse.Namespace(
bench_name="mt_bench",
input_file=None,
judge_model="gpt-4",
baseline_model="gpt-3.5-turbo",
model_list=None,
)
display_result_pairwise(args)
Code Reference
Source Location
| Function | File | Lines |
|---|---|---|
display_result_single |
fastchat/llm_judge/show_result.py |
L9-36 |
display_result_pairwise |
fastchat/llm_judge/show_result.py |
L39-92 |
Signature
def display_result_single(args) -> None:
"""Read JSONL judgment file, group by model and turn, compute mean scores, print tables.
Args:
args: Namespace with bench_name, input_file, judge_model, model_list attributes.
"""
...
def display_result_pairwise(args) -> None:
"""Read JSONL judgment file, compute win/loss/tie rates and adjusted win rate, print table.
Args:
args: Namespace with bench_name, input_file, judge_model, baseline_model, model_list attributes.
"""
...
Import
from fastchat.llm_judge.show_result import display_result_single, display_result_pairwise
I/O Contract
Inputs
| Input | Format | Description |
|---|---|---|
| Single-mode judgment file | JSONL (data/mt_bench/model_judgment/{judge_model}_single.jsonl) |
Each line contains model, score (float), turn (int), and other metadata
|
| Pairwise-mode judgment file | JSONL (data/mt_bench/model_judgment/{judge_model}_pair.jsonl) |
Each line contains model_1, model_2, g1_winner, g2_winner, and other metadata
|
Outputs
| Output | Format | Description |
|---|---|---|
| Single-mode tables | Printed pandas DataFrames to stdout | Three tables: first turn scores, second turn scores, and average scores per model |
| Pairwise-mode table | Printed pandas DataFrame to stdout | One table with win, loss, tie counts, win_rate, loss_rate, and win_rate_adjusted per model |
Example single-mode output:
########## First turn ##########
score
model turn
vicuna-7b-v1.5 1 6.85
llama-2-7b-chat 1 6.20
########## Second turn ##########
score
model turn
vicuna-7b-v1.5 2 5.95
llama-2-7b-chat 2 5.40
########## Average ##########
score
model
vicuna-7b-v1.5 6.40
llama-2-7b-chat 5.80
Example pairwise-mode output:
win loss tie win_rate loss_rate win_rate_adjusted
model
vicuna-7b-v1.5 45 25 10 0.5625 0.3125 0.6250
llama-2-7b-chat 30 35 15 0.3750 0.4375 0.4688
Usage Examples
Quick Model Comparison (Single Mode)
# After running gen_judgment.py in single mode:
python3 -m fastchat.llm_judge.show_result \
--mode single \
--judge-model gpt-4 \
--model-list vicuna-7b-v1.5 llama-2-7b-chat
Pairwise Leaderboard Against GPT-3.5-Turbo
# After running gen_judgment.py in pairwise-baseline mode:
python3 -m fastchat.llm_judge.show_result \
--mode pairwise-baseline \
--judge-model gpt-4 \
--baseline-model gpt-3.5-turbo
All-Pairs Comparison
# After running gen_judgment.py in pairwise-all mode:
python3 -m fastchat.llm_judge.show_result \
--mode pairwise-all \
--judge-model gpt-4
In pairwise-all mode, the baseline_model is automatically set to None, and all model pairs are included in the results.
Related Pages
- Principle:Lm_sys_FastChat_MT_Bench_Result_Display
- Principle:Lm_sys_FastChat_MT_Bench_Result_Display -- The principle this implementation realizes
- Implementation:Lm_sys_FastChat_Gen_Judgment -- The preceding step: generating judge evaluations
- Implementation:Lm_sys_FastChat_Gen_Model_Answer -- The first step: generating model answers