Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Gen Judgment

From Leeroopedia


Field Value
Page Type Implementation
Title Gen Judgment
Repository lm-sys/FastChat
Knowledge Sources Source code analysis of fastchat/llm_judge/gen_judgment.py, fastchat/llm_judge/common.py
Domains LLM Evaluation, Automated Grading, LLM-as-a-Judge, API Integration
Last Updated 2026-02-07 14:00 GMT
Implements Principle:Lm_sys_FastChat_LLM_Judge_Evaluation

Overview

This implementation provides the functions for generating LLM judge evaluations of model answers. The code is split across two files: gen_judgment.py contains the match-making and judge configuration logic, while common.py contains the data classes, match execution functions, score extraction, and API communication. Together they support single-answer grading, pairwise-baseline comparison, and pairwise-all comparison modes.

Description

The judgment generation pipeline works as follows:

  1. The main script loads questions, model answers, reference answers, and judge prompt templates.
  2. Based on the evaluation mode, it creates Judge objects (one per category/turn combination) using make_judge_single or make_judge_pairwise.
  3. Match objects (MatchSingle or MatchPair) are created by pairing each question with each model's answer (and optionally a baseline model's answer and reference answer).
  4. Questions in NEED_REF_CATS (math, reasoning, coding, arena-hard-200) are routed to reference-based judges; all others use the default judge.
  5. Matches are executed via play_a_match_single or play_a_match_pair, which call run_judge_single or run_judge_pair respectively.
  6. The judge functions format the prompt, call the LLM API (OpenAI or Anthropic), extract the score or winner via regex, and return the result.
  7. Results are appended to a JSONL output file. Parallel execution is supported via ThreadPoolExecutor.

Usage

Command-Line Interface

# Single-answer grading
python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat \
    --judge-model gpt-4 \
    --mode single \
    --parallel 4

# Pairwise comparison against baseline
python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat \
    --judge-model gpt-4 \
    --baseline-model gpt-3.5-turbo \
    --mode pairwise-baseline \
    --parallel 4

# Pairwise comparison among all model pairs
python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat alpaca-7b \
    --judge-model gpt-4 \
    --mode pairwise-all \
    --parallel 4

CLI Parameters

Parameter Type Default Description
--bench-name str "mt_bench" Name of the benchmark question set
--judge-file str "data/judge_prompts.jsonl" Path to the judge prompt templates file
--judge-model str "gpt-4" The LLM to use as the judge
--baseline-model str "gpt-3.5-turbo" Baseline model for pairwise-baseline mode
--mode str "single" Evaluation mode: "single", "pairwise-baseline", or "pairwise-all"
--model-list list[str] None List of model IDs to evaluate (auto-detected from answer directory if not set)
--parallel int 1 Number of concurrent API calls
--first-n int None Debug option: only run the first N judgments

Programmatic Usage

from fastchat.llm_judge.gen_judgment import make_match_single, make_judge_single
from fastchat.llm_judge.common import (
    load_questions,
    load_model_answers,
    load_judge_prompts,
    play_a_match_single,
    Judge,
    MatchSingle,
)

# Load data
questions = load_questions("data/mt_bench/question.jsonl", None, None)
model_answers = load_model_answers("data/mt_bench/model_answer")
ref_answers = load_model_answers("data/mt_bench/reference_answer")
judge_prompts = load_judge_prompts("data/judge_prompts.jsonl")

# Create judges
judges = make_judge_single("gpt-4", judge_prompts)

# Create matches for default (non-math) questions
default_questions = [q for q in questions if q["category"] not in ["math", "reasoning", "coding"]]
matches = make_match_single(default_questions, ["vicuna-7b-v1.5"], model_answers, judges["default"])

# Execute matches
for match in matches:
    result = play_a_match_single(match, output_file="data/mt_bench/model_judgment/gpt-4_single.jsonl")
    print(f"Question {result['question_id']}: score={result['score']}")

Code Reference

Source Location

Function / Class File Lines
make_match_single fastchat/llm_judge/gen_judgment.py L108-134
make_match (pairwise-baseline) fastchat/llm_judge/gen_judgment.py L27-65
make_match_all_pairs fastchat/llm_judge/gen_judgment.py L68-105
make_judge_single fastchat/llm_judge/gen_judgment.py L153-166
make_judge_pairwise fastchat/llm_judge/gen_judgment.py L137-150
Judge (dataclass) fastchat/llm_judge/common.py L58-63
MatchSingle (dataclass) fastchat/llm_judge/common.py L66-73
MatchPair (dataclass) fastchat/llm_judge/common.py L76-85
play_a_match_single fastchat/llm_judge/common.py L192-232
play_a_match_pair fastchat/llm_judge/common.py L313-404
run_judge_single fastchat/llm_judge/common.py L135-189
run_judge_pair fastchat/llm_judge/common.py L235-310
chat_completion_openai fastchat/llm_judge/common.py L407-428
chat_completion_anthropic fastchat/llm_judge/common.py L470-493

Signature

def make_match_single(
    questions,
    models,
    model_answers,
    judge,
    baseline_model=None,
    ref_answers=None,
    multi_turn=False,
) -> list[MatchSingle]:
    ...
def make_judge_single(judge_model, judge_prompts) -> dict[str, Judge]:
    ...
def play_a_match_single(match: MatchSingle, output_file: str) -> dict:
    ...
def play_a_match_pair(match: MatchPair, output_file: str) -> dict:
    ...
def run_judge_single(question, answer, judge, ref_answer, multi_turn=False) -> tuple[float, str, str]:
    # Returns (rating, user_prompt, judgment)
    ...

Data classes:

@dataclasses.dataclass
class Judge:
    model_name: str
    prompt_template: dict
    ref_based: bool = False
    multi_turn: bool = False

@dataclasses.dataclass
class MatchSingle:
    question: dict
    model: str
    answer: dict
    judge: Judge
    ref_answer: dict = None
    multi_turn: bool = False

@dataclasses.dataclass
class MatchPair:
    question: dict
    model_1: str
    model_2: str
    answer_1: dict
    answer_2: dict
    judge: Judge
    ref_answer: dict = None
    multi_turn: bool = False

Import

from fastchat.llm_judge.gen_judgment import make_match_single, make_judge_single
from fastchat.llm_judge.gen_judgment import make_match, make_match_all_pairs, make_judge_pairwise
from fastchat.llm_judge.common import (
    play_a_match_single,
    play_a_match_pair,
    run_judge_single,
    run_judge_pair,
    Judge,
    MatchSingle,
    MatchPair,
    NEED_REF_CATS,
    load_questions,
    load_model_answers,
    load_judge_prompts,
    check_data,
    get_model_list,
)

I/O Contract

Inputs

Input Format Description
Question file JSONL (data/mt_bench/question.jsonl) 80 multi-turn questions with question_id, category, turns
Model answer files JSONL directory (data/mt_bench/model_answer/*.jsonl) One file per model, each line contains question_id, model_id, choices
Reference answer files JSONL directory (data/mt_bench/reference_answer/*.jsonl) GPT-4 reference answers for math/reasoning/coding categories
Judge prompt templates JSONL (data/judge_prompts.jsonl) Prompt configurations with name, type, system_prompt, prompt_template, output_format
API credentials Environment variables OPENAI_API_KEY for OpenAI models, ANTHROPIC_API_KEY for Anthropic models

Outputs

Single-Answer Mode

Output file: data/mt_bench/model_judgment/{judge_model}_single.jsonl

Field Type Description
question_id int The question identifier
model str The evaluated model's identifier
judge tuple[str, str] The judge model name and prompt template name
user_prompt str The prompt sent to the judge
judgment str The judge's full text response
score float Extracted score (1-10) or -1 on extraction failure
turn int 1 for first turn, 2 for multi-turn evaluation
tstamp float Unix timestamp

Pairwise Mode

Output file: data/mt_bench/model_judgment/{judge_model}_pair.jsonl

Field Type Description
question_id int The question identifier
model_1 str First model identifier
model_2 str Second model identifier
g1_winner str Winner from game 1 ("model_1", "model_2", "tie", or "error")
g2_winner str Winner from game 2 (positions swapped)
judge tuple[str, str] The judge model name and prompt template name
g1_user_prompt str Prompt for game 1
g1_judgment str Judge's response for game 1
g2_user_prompt str Prompt for game 2
g2_judgment str Judge's response for game 2
turn int 1 for first turn, 2 for multi-turn evaluation
tstamp float Unix timestamp

Usage Examples

Single-Answer Grading with Parallel API Calls

python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat \
    --judge-model gpt-4 \
    --mode single \
    --parallel 8

This evaluates two models using GPT-4 as a single-answer judge with 8 concurrent API calls. For 80 questions, 2 models, and 2 turns (first + multi-turn), this creates 320 total matches.

Pairwise Baseline Comparison

python3 -m fastchat.llm_judge.gen_judgment \
    --model-list vicuna-7b-v1.5 llama-2-7b-chat \
    --judge-model gpt-4 \
    --baseline-model gpt-3.5-turbo \
    --mode pairwise-baseline \
    --parallel 4

Each model is compared against the GPT-3.5-Turbo baseline. Each comparison runs two games (swapped positions) to mitigate position bias.

Programmatic Access to Judge Results

from fastchat.llm_judge.common import load_single_model_judgments

judgments = load_single_model_judgments("data/mt_bench/model_judgment/gpt-4_single.jsonl")
for judge_key, results in judgments.items():
    print(f"Judge: {judge_key}")
    for game_key, result in results.items():
        qid, model = game_key
        print(f"  Q{qid} ({model}): score={result['score']}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment