Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Show Result

From Leeroopedia


Field Value
Page Type Implementation
Title Show Result
Repository lm-sys/FastChat
Knowledge Sources Source code analysis of fastchat/llm_judge/show_result.py
Domains LLM Evaluation, Data Analysis, Result Aggregation
Last Updated 2026-02-07 14:00 GMT
Implements Principle:Lm_sys_FastChat_MT_Bench_Result_Display

Overview

This implementation provides two display functions -- display_result_single and display_result_pairwise -- that read JSONL judgment files, aggregate scores or win/loss/tie counts using pandas, and print formatted summary tables to stdout. The module serves as the final reporting step in the MT-Bench evaluation pipeline.

Description

display_result_single

This function handles single-answer grading results:

  1. Reads the JSONL judgment file into a pandas DataFrame using pd.read_json(input_file, lines=True).
  2. Selects the model, score, and turn columns.
  3. Filters out rows where score == -1 (extraction failures).
  4. Optionally filters to only models in args.model_list.
  5. Prints three tables (for mt_bench):
    • First turn: Filters to turn == 1, groups by (model, turn), computes mean score, sorts descending.
    • Second turn: Filters to turn == 2, groups by (model, turn), computes mean score, sorts descending.
    • Average: Groups by model only, computes mean score across both turns, sorts descending.

display_result_pairwise

This function handles pairwise comparison results:

  1. Reads the JSONL judgment file into a pandas DataFrame.
  2. Filters out rows where g1_winner or g2_winner is "error".
  3. Iterates row by row to classify each comparison:
    • If g1_winner == "tie" or g1_winner != g2_winner (games disagree): both models get a tie.
    • If both games agree: the winner gets a win, the loser gets a loss.
  4. Groups by model and sums win/loss/tie counts.
  5. Removes the baseline model from the output (if filtering by baseline).
  6. Computes:
    • win_rate = win / (win + loss + tie)
    • loss_rate = loss / (win + loss + tie)
    • win_rate_adjusted = (win + 0.5 * tie) / (win + loss + tie)
  7. Prints the table sorted by win_rate_adjusted descending.

Usage

Command-Line Interface

# Display single-answer grading results
python3 -m fastchat.llm_judge.show_result --mode single

# Display pairwise comparison results against baseline
python3 -m fastchat.llm_judge.show_result --mode pairwise-baseline

# Display pairwise comparison results among all pairs
python3 -m fastchat.llm_judge.show_result --mode pairwise-all

# Filter to specific models
python3 -m fastchat.llm_judge.show_result --mode single --model-list vicuna-7b-v1.5 llama-2-7b-chat

# Use a custom input file
python3 -m fastchat.llm_judge.show_result --mode single --input-file /path/to/judgments.jsonl

CLI Parameters

Parameter Type Default Description
--bench-name str "mt_bench" Name of the benchmark (used to locate default input files)
--input-file str None Custom input file path (auto-generated from bench-name and judge-model if not set)
--judge-model str "gpt-4" Judge model (used to locate default input files)
--baseline-model str "gpt-3.5-turbo" Baseline model for pairwise filtering
--model-list list[str] None Optional list of models to include in results
--mode str "single" Display mode: "single", "pairwise-baseline", or "pairwise-all"

Programmatic Usage

from fastchat.llm_judge.show_result import display_result_single, display_result_pairwise
import argparse

# Create args namespace for single-mode display
args = argparse.Namespace(
    bench_name="mt_bench",
    input_file=None,
    judge_model="gpt-4",
    model_list=["vicuna-7b-v1.5", "llama-2-7b-chat"],
)
display_result_single(args)

# Create args namespace for pairwise-mode display
args = argparse.Namespace(
    bench_name="mt_bench",
    input_file=None,
    judge_model="gpt-4",
    baseline_model="gpt-3.5-turbo",
    model_list=None,
)
display_result_pairwise(args)

Code Reference

Source Location

Function File Lines
display_result_single fastchat/llm_judge/show_result.py L9-36
display_result_pairwise fastchat/llm_judge/show_result.py L39-92

Signature

def display_result_single(args) -> None:
    """Read JSONL judgment file, group by model and turn, compute mean scores, print tables.

    Args:
        args: Namespace with bench_name, input_file, judge_model, model_list attributes.
    """
    ...
def display_result_pairwise(args) -> None:
    """Read JSONL judgment file, compute win/loss/tie rates and adjusted win rate, print table.

    Args:
        args: Namespace with bench_name, input_file, judge_model, baseline_model, model_list attributes.
    """
    ...

Import

from fastchat.llm_judge.show_result import display_result_single, display_result_pairwise

I/O Contract

Inputs

Input Format Description
Single-mode judgment file JSONL (data/mt_bench/model_judgment/{judge_model}_single.jsonl) Each line contains model, score (float), turn (int), and other metadata
Pairwise-mode judgment file JSONL (data/mt_bench/model_judgment/{judge_model}_pair.jsonl) Each line contains model_1, model_2, g1_winner, g2_winner, and other metadata

Outputs

Output Format Description
Single-mode tables Printed pandas DataFrames to stdout Three tables: first turn scores, second turn scores, and average scores per model
Pairwise-mode table Printed pandas DataFrame to stdout One table with win, loss, tie counts, win_rate, loss_rate, and win_rate_adjusted per model

Example single-mode output:

########## First turn ##########
                          score
model              turn
vicuna-7b-v1.5     1      6.85
llama-2-7b-chat    1      6.20

########## Second turn ##########
                          score
model              turn
vicuna-7b-v1.5     2      5.95
llama-2-7b-chat    2      5.40

########## Average ##########
                    score
model
vicuna-7b-v1.5      6.40
llama-2-7b-chat      5.80

Example pairwise-mode output:

                    win  loss  tie  win_rate  loss_rate  win_rate_adjusted
model
vicuna-7b-v1.5       45    25   10    0.5625     0.3125             0.6250
llama-2-7b-chat      30    35   15    0.3750     0.4375             0.4688

Usage Examples

Quick Model Comparison (Single Mode)

# After running gen_judgment.py in single mode:
python3 -m fastchat.llm_judge.show_result \
    --mode single \
    --judge-model gpt-4 \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat

Pairwise Leaderboard Against GPT-3.5-Turbo

# After running gen_judgment.py in pairwise-baseline mode:
python3 -m fastchat.llm_judge.show_result \
    --mode pairwise-baseline \
    --judge-model gpt-4 \
    --baseline-model gpt-3.5-turbo

All-Pairs Comparison

# After running gen_judgment.py in pairwise-all mode:
python3 -m fastchat.llm_judge.show_result \
    --mode pairwise-all \
    --judge-model gpt-4

In pairwise-all mode, the baseline_model is automatically set to None, and all model pairs are included in the results.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment