Principle:Lm sys FastChat LLM Judge Evaluation

Field	Value
Page Type	Principle
Title	LLM Judge Evaluation
Repository	lm-sys/FastChat
Knowledge Sources	Source code analysis of `fastchat/llm_judge/gen_judgment.py`, `fastchat/llm_judge/common.py`
Domains	LLM Evaluation, Automated Grading, LLM-as-a-Judge
Last Updated	2026-02-07 14:00 GMT
Implemented By	Implementation:Lm_sys_FastChat_Gen_Judgment

Overview

LLM Judge Evaluation is the principle governing the use of a strong language model (the "judge") to automatically assess the quality of answers produced by candidate models. Rather than relying on human annotators, this approach leverages an LLM -- typically GPT-4 -- to grade model outputs using structured prompt templates. The system supports two evaluation paradigms: single-answer grading (absolute scoring on a 1-10 scale) and pairwise comparison (relative A/B/tie judgments between two models). This principle is the core innovation of the MT-Bench evaluation framework.

Description

Single-Answer Grading

In single-answer mode, the judge LLM receives a question-answer pair and produces an absolute score from 1 to 10. The judge prompt template instructs the model to:

Evaluate the quality, helpfulness, relevance, accuracy, depth, and creativity of the response
Provide a brief explanation of the rating
Output the score in a structured format: rating

The score is extracted programmatically using regex patterns (one_score_pattern and one_score_pattern_backup). If the regex fails to match, a score of -1 is assigned to indicate an extraction error.

Pairwise Comparison

In pairwise mode, the judge receives the same question but two different model answers (labeled A and B) and must determine which is better. The output format supports:

A: Model A's answer is better
B: Model B's answer is better
C: The answers are tied

Alternatively, some prompt templates use a dual-score format (rating_a, rating_b), where the judge assigns individual scores and the winner is determined by comparing them (with a tie delta of 0.1).

Judge Prompt Templates

The system uses a structured prompt template system loaded from a JSONL file (data/judge_prompts.jsonl). Each template contains:

system_prompt: Sets the judge's role and evaluation criteria
prompt_template: The user-facing prompt with placeholders for questions, answers, and optional reference answers
output_format: Specifies the expected output structure (rating or A)
type: Either "single" or "pairwise"
name: Template identifier (e.g., single-v1, pair-v2, single-math-v1)

The system maintains separate templates for:

Default questions vs. math/reasoning questions (which require reference answers)
Single-turn vs. multi-turn evaluation
Single-answer vs. pairwise grading

This yields four judge configurations per mode: default, math, default-mt, and math-mt.

Reference-Based Grading for Math/Reasoning

Categories that have objectively verifiable answers -- specifically math, reasoning, coding, and arena-hard-200 (defined in NEED_REF_CATS) -- use reference-based grading. In this mode, the judge prompt includes a reference answer (typically from GPT-4) alongside the model's response. This allows the judge to evaluate factual correctness against a known-good solution, rather than relying solely on subjective quality assessment.

Position Bias Mitigation

A well-documented issue with LLM-as-a-judge is position bias: the tendency for judges to favor whichever answer appears first (position A). The MT-Bench pairwise evaluation mitigates this by running two games per comparison:

Game 1: Model 1's answer is placed at position A, Model 2's at position B
Game 2: The positions are swapped -- Model 2's answer is at position A, Model 1's at position B

The final winner is determined by comparing both games:

If both games agree on the winner (after mapping back to the original model identities), that model wins
If the games disagree (each favoring the model in position A), the result is treated as a tie

This symmetric evaluation design significantly reduces position bias artifacts.

Score Extraction via Regex

Scores and winners are extracted from the judge's free-text judgment using regular expressions:

Single-answer: \[\[(\d+\.?\d*)\]\] matches scores like 7 or 8.5
Pairwise (dual-score): \[\[(\d+\.?\d*),\s?(\d+\.?\d*)\]\] matches score pairs like 8, 7
Pairwise (winner): Direct string matching for A, B, or C

Backup patterns without double brackets are tried if the primary patterns fail.

Multi-Turn Evaluation

MT-Bench evaluates both turns of the conversation. The system generates separate judgments for:

Turn 1 only: The judge sees only the first question and first answer
Both turns (multi-turn): The judge sees both question-answer pairs and evaluates the second turn in context of the first

This is controlled via the multi_turn flag, which selects the appropriate prompt template (e.g., single-v1-multi-turn instead of single-v1).

Usage

The LLM Judge Evaluation principle is applied in the second phase of the MT-Bench workflow:

Generate answers first: Complete the answer generation phase (Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation) for all models to be evaluated.
Configure the judge: Select the judge model (default: GPT-4), evaluation mode (single, pairwise-baseline, or pairwise-all), and optionally a baseline model for pairwise comparisons.
Run judgment generation: The system creates match objects pairing questions with model answers (and optionally reference answers), then calls the judge API for each match.
Review outputs: Judgment files contain the raw judge text, extracted scores/winners, and metadata. These feed into the result display phase (Principle:Lm_sys_FastChat_MT_Bench_Result_Display).

Theoretical Basis

The LLM-as-a-Judge approach is grounded in several key research insights:

Scalability over human evaluation: Human evaluation is the gold standard but is expensive, slow, and difficult to reproduce. Using a strong LLM as a proxy judge enables rapid, repeatable evaluation at scale.
High correlation with human preferences: Research (Zheng et al., 2023) has demonstrated that GPT-4 judgments achieve over 80% agreement with human preferences, comparable to inter-annotator agreement among humans.
Position bias is a known confound: LLMs exhibit systematic position bias in pairwise evaluations. The two-game swapping strategy is a principled debiasing technique that treats disagreements as ties, reducing false positives.
Reference-grounded evaluation: For tasks with objectively correct answers (math, coding), providing reference answers to the judge significantly improves evaluation accuracy by anchoring the assessment to ground truth.
Structured output extraction: Requiring the judge to produce scores in a specific format (e.g., rating) and extracting them via regex is more robust than parsing free-text judgments, though fallback patterns are needed to handle format deviations.

Related Pages

Implementation:Lm_sys_FastChat_Gen_Judgment
Implementation:Lm_sys_FastChat_Gen_Judgment -- The implementation that realizes this principle
Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation -- The preceding phase that generates the answers to be judged
Principle:Lm_sys_FastChat_MT_Bench_Result_Display -- The subsequent phase that aggregates and displays judgment results

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment