Implementation:Lm sys FastChat Gen Judgment

Field	Value
Page Type	Implementation
Title	Gen Judgment
Repository	lm-sys/FastChat
Knowledge Sources	Source code analysis of `fastchat/llm_judge/gen_judgment.py`, `fastchat/llm_judge/common.py`
Domains	LLM Evaluation, Automated Grading, LLM-as-a-Judge, API Integration
Last Updated	2026-02-07 14:00 GMT
Implements	Principle:Lm_sys_FastChat_LLM_Judge_Evaluation

Overview

This implementation provides the functions for generating LLM judge evaluations of model answers. The code is split across two files: gen_judgment.py contains the match-making and judge configuration logic, while common.py contains the data classes, match execution functions, score extraction, and API communication. Together they support single-answer grading, pairwise-baseline comparison, and pairwise-all comparison modes.

Description

The judgment generation pipeline works as follows:

The main script loads questions, model answers, reference answers, and judge prompt templates.
Based on the evaluation mode, it creates Judge objects (one per category/turn combination) using make_judge_single or make_judge_pairwise.
Match objects (MatchSingle or MatchPair) are created by pairing each question with each model's answer (and optionally a baseline model's answer and reference answer).
Questions in NEED_REF_CATS (math, reasoning, coding, arena-hard-200) are routed to reference-based judges; all others use the default judge.
Matches are executed via play_a_match_single or play_a_match_pair, which call run_judge_single or run_judge_pair respectively.
The judge functions format the prompt, call the LLM API (OpenAI or Anthropic), extract the score or winner via regex, and return the result.
Results are appended to a JSONL output file. Parallel execution is supported via ThreadPoolExecutor.

Usage

Command-Line Interface

# Single-answer grading
python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat \
    --judge-model gpt-4 \
    --mode single \
    --parallel 4

# Pairwise comparison against baseline
python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat \
    --judge-model gpt-4 \
    --baseline-model gpt-3.5-turbo \
    --mode pairwise-baseline \
    --parallel 4

# Pairwise comparison among all model pairs
python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat alpaca-7b \
    --judge-model gpt-4 \
    --mode pairwise-all \
    --parallel 4

CLI Parameters

Parameter	Type	Default	Description
`--bench-name`	str	`"mt_bench"`	Name of the benchmark question set
`--judge-file`	str	`"data/judge_prompts.jsonl"`	Path to the judge prompt templates file
`--judge-model`	str	`"gpt-4"`	The LLM to use as the judge
`--baseline-model`	str	`"gpt-3.5-turbo"`	Baseline model for pairwise-baseline mode
`--mode`	str	`"single"`	Evaluation mode: `"single"`, `"pairwise-baseline"`, or `"pairwise-all"`
`--model-list`	list[str]	None	List of model IDs to evaluate (auto-detected from answer directory if not set)
`--parallel`	int	1	Number of concurrent API calls
`--first-n`	int	None	Debug option: only run the first N judgments

Programmatic Usage

from fastchat.llm_judge.gen_judgment import make_match_single, make_judge_single
from fastchat.llm_judge.common import (
    load_questions,
    load_model_answers,
    load_judge_prompts,
    play_a_match_single,
    Judge,
    MatchSingle,
)

# Load data
questions = load_questions("data/mt_bench/question.jsonl", None, None)
model_answers = load_model_answers("data/mt_bench/model_answer")
ref_answers = load_model_answers("data/mt_bench/reference_answer")
judge_prompts = load_judge_prompts("data/judge_prompts.jsonl")

# Create judges
judges = make_judge_single("gpt-4", judge_prompts)

# Create matches for default (non-math) questions
default_questions = [q for q in questions if q["category"] not in ["math", "reasoning", "coding"]]
matches = make_match_single(default_questions, ["vicuna-7b-v1.5"], model_answers, judges["default"])

# Execute matches
for match in matches:
    result = play_a_match_single(match, output_file="data/mt_bench/model_judgment/gpt-4_single.jsonl")
    print(f"Question {result['question_id']}: score={result['score']}")

Code Reference

Source Location

Function / Class	File	Lines
`make_match_single`	`fastchat/llm_judge/gen_judgment.py`	L108-134
`make_match` (pairwise-baseline)	`fastchat/llm_judge/gen_judgment.py`	L27-65
`make_match_all_pairs`	`fastchat/llm_judge/gen_judgment.py`	L68-105
`make_judge_single`	`fastchat/llm_judge/gen_judgment.py`	L153-166
`make_judge_pairwise`	`fastchat/llm_judge/gen_judgment.py`	L137-150
`Judge` (dataclass)	`fastchat/llm_judge/common.py`	L58-63
`MatchSingle` (dataclass)	`fastchat/llm_judge/common.py`	L66-73
`MatchPair` (dataclass)	`fastchat/llm_judge/common.py`	L76-85
`play_a_match_single`	`fastchat/llm_judge/common.py`	L192-232
`play_a_match_pair`	`fastchat/llm_judge/common.py`	L313-404
`run_judge_single`	`fastchat/llm_judge/common.py`	L135-189
`run_judge_pair`	`fastchat/llm_judge/common.py`	L235-310
`chat_completion_openai`	`fastchat/llm_judge/common.py`	L407-428
`chat_completion_anthropic`	`fastchat/llm_judge/common.py`	L470-493

Signature

def make_match_single(
    questions,
    models,
    model_answers,
    judge,
    baseline_model=None,
    ref_answers=None,
    multi_turn=False,
) -> list[MatchSingle]:
    ...

def make_judge_single(judge_model, judge_prompts) -> dict[str, Judge]:
    ...

def play_a_match_single(match: MatchSingle, output_file: str) -> dict:
    ...

def play_a_match_pair(match: MatchPair, output_file: str) -> dict:
    ...

def run_judge_single(question, answer, judge, ref_answer, multi_turn=False) -> tuple[float, str, str]:
    # Returns (rating, user_prompt, judgment)
    ...

Data classes:

@dataclasses.dataclass
class Judge:
    model_name: str
    prompt_template: dict
    ref_based: bool = False
    multi_turn: bool = False

@dataclasses.dataclass
class MatchSingle:
    question: dict
    model: str
    answer: dict
    judge: Judge
    ref_answer: dict = None
    multi_turn: bool = False

@dataclasses.dataclass
class MatchPair:
    question: dict
    model_1: str
    model_2: str
    answer_1: dict
    answer_2: dict
    judge: Judge
    ref_answer: dict = None
    multi_turn: bool = False

Import

from fastchat.llm_judge.gen_judgment import make_match_single, make_judge_single
from fastchat.llm_judge.gen_judgment import make_match, make_match_all_pairs, make_judge_pairwise
from fastchat.llm_judge.common import (
    play_a_match_single,
    play_a_match_pair,
    run_judge_single,
    run_judge_pair,
    Judge,
    MatchSingle,
    MatchPair,
    NEED_REF_CATS,
    load_questions,
    load_model_answers,
    load_judge_prompts,
    check_data,
    get_model_list,
)

I/O Contract

Inputs

Input	Format	Description
Question file	JSONL (`data/mt_bench/question.jsonl`)	80 multi-turn questions with `question_id`, `category`, `turns`
Model answer files	JSONL directory (`data/mt_bench/model_answer/*.jsonl`)	One file per model, each line contains `question_id`, `model_id`, `choices`
Reference answer files	JSONL directory (`data/mt_bench/reference_answer/*.jsonl`)	GPT-4 reference answers for math/reasoning/coding categories
Judge prompt templates	JSONL (`data/judge_prompts.jsonl`)	Prompt configurations with `name`, `type`, `system_prompt`, `prompt_template`, `output_format`
API credentials	Environment variables	`OPENAI_API_KEY` for OpenAI models, `ANTHROPIC_API_KEY` for Anthropic models

Outputs

Single-Answer Mode

Output file: data/mt_bench/model_judgment/{judge_model}_single.jsonl

Field	Type	Description
`question_id`	int	The question identifier
`model`	str	The evaluated model's identifier
`judge`	tuple[str, str]	The judge model name and prompt template name
`user_prompt`	str	The prompt sent to the judge
`judgment`	str	The judge's full text response
`score`	float	Extracted score (1-10) or -1 on extraction failure
`turn`	int	1 for first turn, 2 for multi-turn evaluation
`tstamp`	float	Unix timestamp

Pairwise Mode

Output file: data/mt_bench/model_judgment/{judge_model}_pair.jsonl

Field	Type	Description
`question_id`	int	The question identifier
`model_1`	str	First model identifier
`model_2`	str	Second model identifier
`g1_winner`	str	Winner from game 1 (`"model_1"`, `"model_2"`, `"tie"`, or `"error"`)
`g2_winner`	str	Winner from game 2 (positions swapped)
`judge`	tuple[str, str]	The judge model name and prompt template name
`g1_user_prompt`	str	Prompt for game 1
`g1_judgment`	str	Judge's response for game 1
`g2_user_prompt`	str	Prompt for game 2
`g2_judgment`	str	Judge's response for game 2
`turn`	int	1 for first turn, 2 for multi-turn evaluation
`tstamp`	float	Unix timestamp

Usage Examples

Single-Answer Grading with Parallel API Calls

python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat \
    --judge-model gpt-4 \
    --mode single \
    --parallel 8

This evaluates two models using GPT-4 as a single-answer judge with 8 concurrent API calls. For 80 questions, 2 models, and 2 turns (first + multi-turn), this creates 320 total matches.

Pairwise Baseline Comparison

python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat \
    --judge-model gpt-4 \
    --baseline-model gpt-3.5-turbo \
    --mode pairwise-baseline \
    --parallel 4

Each model is compared against the GPT-3.5-Turbo baseline. Each comparison runs two games (swapped positions) to mitigate position bias.

Programmatic Access to Judge Results

from fastchat.llm_judge.common import load_single_model_judgments

judgments = load_single_model_judgments("data/mt_bench/model_judgment/gpt-4_single.jsonl")
for judge_key, results in judgments.items():
    print(f"Judge: {judge_key}")
    for game_key, result in results.items():
        qid, model = game_key
        print(f"  Q{qid} ({model}): score={result['score']}")

Related Pages

Principle:Lm_sys_FastChat_LLM_Judge_Evaluation
Principle:Lm_sys_FastChat_LLM_Judge_Evaluation -- The principle this implementation realizes
Implementation:Lm_sys_FastChat_Gen_Model_Answer -- The preceding step: generating model answers
Implementation:Lm_sys_FastChat_Show_Result -- The subsequent step: displaying aggregated results
Environment:Lm_sys_FastChat_API_Keys_And_Credentials

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment