Implementation:Lm sys FastChat Gen Judgment
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Gen Judgment |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source code analysis of fastchat/llm_judge/gen_judgment.py, fastchat/llm_judge/common.py
|
| Domains | LLM Evaluation, Automated Grading, LLM-as-a-Judge, API Integration |
| Last Updated | 2026-02-07 14:00 GMT |
| Implements | Principle:Lm_sys_FastChat_LLM_Judge_Evaluation |
Overview
This implementation provides the functions for generating LLM judge evaluations of model answers. The code is split across two files: gen_judgment.py contains the match-making and judge configuration logic, while common.py contains the data classes, match execution functions, score extraction, and API communication. Together they support single-answer grading, pairwise-baseline comparison, and pairwise-all comparison modes.
Description
The judgment generation pipeline works as follows:
- The main script loads questions, model answers, reference answers, and judge prompt templates.
- Based on the evaluation mode, it creates Judge objects (one per category/turn combination) using
make_judge_singleormake_judge_pairwise. - Match objects (MatchSingle or MatchPair) are created by pairing each question with each model's answer (and optionally a baseline model's answer and reference answer).
- Questions in
NEED_REF_CATS(math, reasoning, coding, arena-hard-200) are routed to reference-based judges; all others use the default judge. - Matches are executed via
play_a_match_singleorplay_a_match_pair, which callrun_judge_singleorrun_judge_pairrespectively. - The judge functions format the prompt, call the LLM API (OpenAI or Anthropic), extract the score or winner via regex, and return the result.
- Results are appended to a JSONL output file. Parallel execution is supported via
ThreadPoolExecutor.
Usage
Command-Line Interface
# Single-answer grading
python3 -m fastchat.llm_judge.gen_judgment \
--model-list vicuna-7b-v1.5 llama-2-7b-chat \
--judge-model gpt-4 \
--mode single \
--parallel 4
# Pairwise comparison against baseline
python3 -m fastchat.llm_judge.gen_judgment \
--model-list vicuna-7b-v1.5 llama-2-7b-chat \
--judge-model gpt-4 \
--baseline-model gpt-3.5-turbo \
--mode pairwise-baseline \
--parallel 4
# Pairwise comparison among all model pairs
python3 -m fastchat.llm_judge.gen_judgment \
--model-list vicuna-7b-v1.5 llama-2-7b-chat alpaca-7b \
--judge-model gpt-4 \
--mode pairwise-all \
--parallel 4
CLI Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--bench-name |
str | "mt_bench" |
Name of the benchmark question set |
--judge-file |
str | "data/judge_prompts.jsonl" |
Path to the judge prompt templates file |
--judge-model |
str | "gpt-4" |
The LLM to use as the judge |
--baseline-model |
str | "gpt-3.5-turbo" |
Baseline model for pairwise-baseline mode |
--mode |
str | "single" |
Evaluation mode: "single", "pairwise-baseline", or "pairwise-all"
|
--model-list |
list[str] | None | List of model IDs to evaluate (auto-detected from answer directory if not set) |
--parallel |
int | 1 | Number of concurrent API calls |
--first-n |
int | None | Debug option: only run the first N judgments |
Programmatic Usage
from fastchat.llm_judge.gen_judgment import make_match_single, make_judge_single
from fastchat.llm_judge.common import (
load_questions,
load_model_answers,
load_judge_prompts,
play_a_match_single,
Judge,
MatchSingle,
)
# Load data
questions = load_questions("data/mt_bench/question.jsonl", None, None)
model_answers = load_model_answers("data/mt_bench/model_answer")
ref_answers = load_model_answers("data/mt_bench/reference_answer")
judge_prompts = load_judge_prompts("data/judge_prompts.jsonl")
# Create judges
judges = make_judge_single("gpt-4", judge_prompts)
# Create matches for default (non-math) questions
default_questions = [q for q in questions if q["category"] not in ["math", "reasoning", "coding"]]
matches = make_match_single(default_questions, ["vicuna-7b-v1.5"], model_answers, judges["default"])
# Execute matches
for match in matches:
result = play_a_match_single(match, output_file="data/mt_bench/model_judgment/gpt-4_single.jsonl")
print(f"Question {result['question_id']}: score={result['score']}")
Code Reference
Source Location
| Function / Class | File | Lines |
|---|---|---|
make_match_single |
fastchat/llm_judge/gen_judgment.py |
L108-134 |
make_match (pairwise-baseline) |
fastchat/llm_judge/gen_judgment.py |
L27-65 |
make_match_all_pairs |
fastchat/llm_judge/gen_judgment.py |
L68-105 |
make_judge_single |
fastchat/llm_judge/gen_judgment.py |
L153-166 |
make_judge_pairwise |
fastchat/llm_judge/gen_judgment.py |
L137-150 |
Judge (dataclass) |
fastchat/llm_judge/common.py |
L58-63 |
MatchSingle (dataclass) |
fastchat/llm_judge/common.py |
L66-73 |
MatchPair (dataclass) |
fastchat/llm_judge/common.py |
L76-85 |
play_a_match_single |
fastchat/llm_judge/common.py |
L192-232 |
play_a_match_pair |
fastchat/llm_judge/common.py |
L313-404 |
run_judge_single |
fastchat/llm_judge/common.py |
L135-189 |
run_judge_pair |
fastchat/llm_judge/common.py |
L235-310 |
chat_completion_openai |
fastchat/llm_judge/common.py |
L407-428 |
chat_completion_anthropic |
fastchat/llm_judge/common.py |
L470-493 |
Signature
def make_match_single(
questions,
models,
model_answers,
judge,
baseline_model=None,
ref_answers=None,
multi_turn=False,
) -> list[MatchSingle]:
...
def make_judge_single(judge_model, judge_prompts) -> dict[str, Judge]:
...
def play_a_match_single(match: MatchSingle, output_file: str) -> dict:
...
def play_a_match_pair(match: MatchPair, output_file: str) -> dict:
...
def run_judge_single(question, answer, judge, ref_answer, multi_turn=False) -> tuple[float, str, str]:
# Returns (rating, user_prompt, judgment)
...
Data classes:
@dataclasses.dataclass
class Judge:
model_name: str
prompt_template: dict
ref_based: bool = False
multi_turn: bool = False
@dataclasses.dataclass
class MatchSingle:
question: dict
model: str
answer: dict
judge: Judge
ref_answer: dict = None
multi_turn: bool = False
@dataclasses.dataclass
class MatchPair:
question: dict
model_1: str
model_2: str
answer_1: dict
answer_2: dict
judge: Judge
ref_answer: dict = None
multi_turn: bool = False
Import
from fastchat.llm_judge.gen_judgment import make_match_single, make_judge_single
from fastchat.llm_judge.gen_judgment import make_match, make_match_all_pairs, make_judge_pairwise
from fastchat.llm_judge.common import (
play_a_match_single,
play_a_match_pair,
run_judge_single,
run_judge_pair,
Judge,
MatchSingle,
MatchPair,
NEED_REF_CATS,
load_questions,
load_model_answers,
load_judge_prompts,
check_data,
get_model_list,
)
I/O Contract
Inputs
| Input | Format | Description |
|---|---|---|
| Question file | JSONL (data/mt_bench/question.jsonl) |
80 multi-turn questions with question_id, category, turns
|
| Model answer files | JSONL directory (data/mt_bench/model_answer/*.jsonl) |
One file per model, each line contains question_id, model_id, choices
|
| Reference answer files | JSONL directory (data/mt_bench/reference_answer/*.jsonl) |
GPT-4 reference answers for math/reasoning/coding categories |
| Judge prompt templates | JSONL (data/judge_prompts.jsonl) |
Prompt configurations with name, type, system_prompt, prompt_template, output_format
|
| API credentials | Environment variables | OPENAI_API_KEY for OpenAI models, ANTHROPIC_API_KEY for Anthropic models
|
Outputs
Single-Answer Mode
Output file: data/mt_bench/model_judgment/{judge_model}_single.jsonl
| Field | Type | Description |
|---|---|---|
question_id |
int | The question identifier |
model |
str | The evaluated model's identifier |
judge |
tuple[str, str] | The judge model name and prompt template name |
user_prompt |
str | The prompt sent to the judge |
judgment |
str | The judge's full text response |
score |
float | Extracted score (1-10) or -1 on extraction failure |
turn |
int | 1 for first turn, 2 for multi-turn evaluation |
tstamp |
float | Unix timestamp |
Pairwise Mode
Output file: data/mt_bench/model_judgment/{judge_model}_pair.jsonl
| Field | Type | Description |
|---|---|---|
question_id |
int | The question identifier |
model_1 |
str | First model identifier |
model_2 |
str | Second model identifier |
g1_winner |
str | Winner from game 1 ("model_1", "model_2", "tie", or "error")
|
g2_winner |
str | Winner from game 2 (positions swapped) |
judge |
tuple[str, str] | The judge model name and prompt template name |
g1_user_prompt |
str | Prompt for game 1 |
g1_judgment |
str | Judge's response for game 1 |
g2_user_prompt |
str | Prompt for game 2 |
g2_judgment |
str | Judge's response for game 2 |
turn |
int | 1 for first turn, 2 for multi-turn evaluation |
tstamp |
float | Unix timestamp |
Usage Examples
Single-Answer Grading with Parallel API Calls
python3 -m fastchat.llm_judge.gen_judgment \
--model-list vicuna-7b-v1.5 llama-2-7b-chat \
--judge-model gpt-4 \
--mode single \
--parallel 8
This evaluates two models using GPT-4 as a single-answer judge with 8 concurrent API calls. For 80 questions, 2 models, and 2 turns (first + multi-turn), this creates 320 total matches.
Pairwise Baseline Comparison
python3 -m fastchat.llm_judge.gen_judgment \
--model-list vicuna-7b-v1.5 llama-2-7b-chat \
--judge-model gpt-4 \
--baseline-model gpt-3.5-turbo \
--mode pairwise-baseline \
--parallel 4
Each model is compared against the GPT-3.5-Turbo baseline. Each comparison runs two games (swapped positions) to mitigate position bias.
Programmatic Access to Judge Results
from fastchat.llm_judge.common import load_single_model_judgments
judgments = load_single_model_judgments("data/mt_bench/model_judgment/gpt-4_single.jsonl")
for judge_key, results in judgments.items():
print(f"Judge: {judge_key}")
for game_key, result in results.items():
qid, model = game_key
print(f" Q{qid} ({model}): score={result['score']}")
Related Pages
- Principle:Lm_sys_FastChat_LLM_Judge_Evaluation
- Principle:Lm_sys_FastChat_LLM_Judge_Evaluation -- The principle this implementation realizes
- Implementation:Lm_sys_FastChat_Gen_Model_Answer -- The preceding step: generating model answers
- Implementation:Lm_sys_FastChat_Show_Result -- The subsequent step: displaying aggregated results
- Environment:Lm_sys_FastChat_API_Keys_And_Credentials