Principle:Lm sys FastChat LLM Judge Evaluation
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | LLM Judge Evaluation |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source code analysis of fastchat/llm_judge/gen_judgment.py, fastchat/llm_judge/common.py
|
| Domains | LLM Evaluation, Automated Grading, LLM-as-a-Judge |
| Last Updated | 2026-02-07 14:00 GMT |
| Implemented By | Implementation:Lm_sys_FastChat_Gen_Judgment |
Overview
LLM Judge Evaluation is the principle governing the use of a strong language model (the "judge") to automatically assess the quality of answers produced by candidate models. Rather than relying on human annotators, this approach leverages an LLM -- typically GPT-4 -- to grade model outputs using structured prompt templates. The system supports two evaluation paradigms: single-answer grading (absolute scoring on a 1-10 scale) and pairwise comparison (relative A/B/tie judgments between two models). This principle is the core innovation of the MT-Bench evaluation framework.
Description
Single-Answer Grading
In single-answer mode, the judge LLM receives a question-answer pair and produces an absolute score from 1 to 10. The judge prompt template instructs the model to:
- Evaluate the quality, helpfulness, relevance, accuracy, depth, and creativity of the response
- Provide a brief explanation of the rating
- Output the score in a structured format:
rating
The score is extracted programmatically using regex patterns (one_score_pattern and one_score_pattern_backup). If the regex fails to match, a score of -1 is assigned to indicate an extraction error.
Pairwise Comparison
In pairwise mode, the judge receives the same question but two different model answers (labeled A and B) and must determine which is better. The output format supports:
Alternatively, some prompt templates use a dual-score format (rating_a, rating_b), where the judge assigns individual scores and the winner is determined by comparing them (with a tie delta of 0.1).
Judge Prompt Templates
The system uses a structured prompt template system loaded from a JSONL file (data/judge_prompts.jsonl). Each template contains:
- system_prompt: Sets the judge's role and evaluation criteria
- prompt_template: The user-facing prompt with placeholders for questions, answers, and optional reference answers
- output_format: Specifies the expected output structure (
ratingorA) - type: Either
"single"or"pairwise" - name: Template identifier (e.g.,
single-v1,pair-v2,single-math-v1)
The system maintains separate templates for:
- Default questions vs. math/reasoning questions (which require reference answers)
- Single-turn vs. multi-turn evaluation
- Single-answer vs. pairwise grading
This yields four judge configurations per mode: default, math, default-mt, and math-mt.
Reference-Based Grading for Math/Reasoning
Categories that have objectively verifiable answers -- specifically math, reasoning, coding, and arena-hard-200 (defined in NEED_REF_CATS) -- use reference-based grading. In this mode, the judge prompt includes a reference answer (typically from GPT-4) alongside the model's response. This allows the judge to evaluate factual correctness against a known-good solution, rather than relying solely on subjective quality assessment.
Position Bias Mitigation
A well-documented issue with LLM-as-a-judge is position bias: the tendency for judges to favor whichever answer appears first (position A). The MT-Bench pairwise evaluation mitigates this by running two games per comparison:
- Game 1: Model 1's answer is placed at position A, Model 2's at position B
- Game 2: The positions are swapped -- Model 2's answer is at position A, Model 1's at position B
The final winner is determined by comparing both games:
- If both games agree on the winner (after mapping back to the original model identities), that model wins
- If the games disagree (each favoring the model in position A), the result is treated as a tie
This symmetric evaluation design significantly reduces position bias artifacts.
Score Extraction via Regex
Scores and winners are extracted from the judge's free-text judgment using regular expressions:
- Single-answer:
\[\[(\d+\.?\d*)\]\]matches scores like7or8.5 - Pairwise (dual-score):
\[\[(\d+\.?\d*),\s?(\d+\.?\d*)\]\]matches score pairs like8, 7 - Pairwise (winner): Direct string matching for
A,B, orC
Backup patterns without double brackets are tried if the primary patterns fail.
Multi-Turn Evaluation
MT-Bench evaluates both turns of the conversation. The system generates separate judgments for:
- Turn 1 only: The judge sees only the first question and first answer
- Both turns (multi-turn): The judge sees both question-answer pairs and evaluates the second turn in context of the first
This is controlled via the multi_turn flag, which selects the appropriate prompt template (e.g., single-v1-multi-turn instead of single-v1).
Usage
The LLM Judge Evaluation principle is applied in the second phase of the MT-Bench workflow:
- Generate answers first: Complete the answer generation phase (Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation) for all models to be evaluated.
- Configure the judge: Select the judge model (default: GPT-4), evaluation mode (single, pairwise-baseline, or pairwise-all), and optionally a baseline model for pairwise comparisons.
- Run judgment generation: The system creates match objects pairing questions with model answers (and optionally reference answers), then calls the judge API for each match.
- Review outputs: Judgment files contain the raw judge text, extracted scores/winners, and metadata. These feed into the result display phase (Principle:Lm_sys_FastChat_MT_Bench_Result_Display).
Theoretical Basis
The LLM-as-a-Judge approach is grounded in several key research insights:
- Scalability over human evaluation: Human evaluation is the gold standard but is expensive, slow, and difficult to reproduce. Using a strong LLM as a proxy judge enables rapid, repeatable evaluation at scale.
- High correlation with human preferences: Research (Zheng et al., 2023) has demonstrated that GPT-4 judgments achieve over 80% agreement with human preferences, comparable to inter-annotator agreement among humans.
- Position bias is a known confound: LLMs exhibit systematic position bias in pairwise evaluations. The two-game swapping strategy is a principled debiasing technique that treats disagreements as ties, reducing false positives.
- Reference-grounded evaluation: For tasks with objectively correct answers (math, coding), providing reference answers to the judge significantly improves evaluation accuracy by anchoring the assessment to ground truth.
- Structured output extraction: Requiring the judge to produce scores in a specific format (e.g.,
rating) and extracting them via regex is more robust than parsing free-text judgments, though fallback patterns are needed to handle format deviations.
Related Pages
- Implementation:Lm_sys_FastChat_Gen_Judgment
- Implementation:Lm_sys_FastChat_Gen_Judgment -- The implementation that realizes this principle
- Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation -- The preceding phase that generates the answers to be judged
- Principle:Lm_sys_FastChat_MT_Bench_Result_Display -- The subsequent phase that aggregates and displays judgment results