Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat LLM Judge Evaluation

From Leeroopedia


Field Value
Page Type Principle
Title LLM Judge Evaluation
Repository lm-sys/FastChat
Knowledge Sources Source code analysis of fastchat/llm_judge/gen_judgment.py, fastchat/llm_judge/common.py
Domains LLM Evaluation, Automated Grading, LLM-as-a-Judge
Last Updated 2026-02-07 14:00 GMT
Implemented By Implementation:Lm_sys_FastChat_Gen_Judgment

Overview

LLM Judge Evaluation is the principle governing the use of a strong language model (the "judge") to automatically assess the quality of answers produced by candidate models. Rather than relying on human annotators, this approach leverages an LLM -- typically GPT-4 -- to grade model outputs using structured prompt templates. The system supports two evaluation paradigms: single-answer grading (absolute scoring on a 1-10 scale) and pairwise comparison (relative A/B/tie judgments between two models). This principle is the core innovation of the MT-Bench evaluation framework.

Description

Single-Answer Grading

In single-answer mode, the judge LLM receives a question-answer pair and produces an absolute score from 1 to 10. The judge prompt template instructs the model to:

  • Evaluate the quality, helpfulness, relevance, accuracy, depth, and creativity of the response
  • Provide a brief explanation of the rating
  • Output the score in a structured format: rating

The score is extracted programmatically using regex patterns (one_score_pattern and one_score_pattern_backup). If the regex fails to match, a score of -1 is assigned to indicate an extraction error.

Pairwise Comparison

In pairwise mode, the judge receives the same question but two different model answers (labeled A and B) and must determine which is better. The output format supports:

  • A: Model A's answer is better
  • B: Model B's answer is better
  • C: The answers are tied

Alternatively, some prompt templates use a dual-score format (rating_a, rating_b), where the judge assigns individual scores and the winner is determined by comparing them (with a tie delta of 0.1).

Judge Prompt Templates

The system uses a structured prompt template system loaded from a JSONL file (data/judge_prompts.jsonl). Each template contains:

  • system_prompt: Sets the judge's role and evaluation criteria
  • prompt_template: The user-facing prompt with placeholders for questions, answers, and optional reference answers
  • output_format: Specifies the expected output structure (rating or A)
  • type: Either "single" or "pairwise"
  • name: Template identifier (e.g., single-v1, pair-v2, single-math-v1)

The system maintains separate templates for:

  • Default questions vs. math/reasoning questions (which require reference answers)
  • Single-turn vs. multi-turn evaluation
  • Single-answer vs. pairwise grading

This yields four judge configurations per mode: default, math, default-mt, and math-mt.

Reference-Based Grading for Math/Reasoning

Categories that have objectively verifiable answers -- specifically math, reasoning, coding, and arena-hard-200 (defined in NEED_REF_CATS) -- use reference-based grading. In this mode, the judge prompt includes a reference answer (typically from GPT-4) alongside the model's response. This allows the judge to evaluate factual correctness against a known-good solution, rather than relying solely on subjective quality assessment.

Position Bias Mitigation

A well-documented issue with LLM-as-a-judge is position bias: the tendency for judges to favor whichever answer appears first (position A). The MT-Bench pairwise evaluation mitigates this by running two games per comparison:

  1. Game 1: Model 1's answer is placed at position A, Model 2's at position B
  2. Game 2: The positions are swapped -- Model 2's answer is at position A, Model 1's at position B

The final winner is determined by comparing both games:

  • If both games agree on the winner (after mapping back to the original model identities), that model wins
  • If the games disagree (each favoring the model in position A), the result is treated as a tie

This symmetric evaluation design significantly reduces position bias artifacts.

Score Extraction via Regex

Scores and winners are extracted from the judge's free-text judgment using regular expressions:

  • Single-answer: \[\[(\d+\.?\d*)\]\] matches scores like 7 or 8.5
  • Pairwise (dual-score): \[\[(\d+\.?\d*),\s?(\d+\.?\d*)\]\] matches score pairs like 8, 7
  • Pairwise (winner): Direct string matching for A, B, or C

Backup patterns without double brackets are tried if the primary patterns fail.

Multi-Turn Evaluation

MT-Bench evaluates both turns of the conversation. The system generates separate judgments for:

  • Turn 1 only: The judge sees only the first question and first answer
  • Both turns (multi-turn): The judge sees both question-answer pairs and evaluates the second turn in context of the first

This is controlled via the multi_turn flag, which selects the appropriate prompt template (e.g., single-v1-multi-turn instead of single-v1).

Usage

The LLM Judge Evaluation principle is applied in the second phase of the MT-Bench workflow:

  1. Generate answers first: Complete the answer generation phase (Principle:Lm_sys_FastChat_MT_Bench_Answer_Generation) for all models to be evaluated.
  2. Configure the judge: Select the judge model (default: GPT-4), evaluation mode (single, pairwise-baseline, or pairwise-all), and optionally a baseline model for pairwise comparisons.
  3. Run judgment generation: The system creates match objects pairing questions with model answers (and optionally reference answers), then calls the judge API for each match.
  4. Review outputs: Judgment files contain the raw judge text, extracted scores/winners, and metadata. These feed into the result display phase (Principle:Lm_sys_FastChat_MT_Bench_Result_Display).

Theoretical Basis

The LLM-as-a-Judge approach is grounded in several key research insights:

  • Scalability over human evaluation: Human evaluation is the gold standard but is expensive, slow, and difficult to reproduce. Using a strong LLM as a proxy judge enables rapid, repeatable evaluation at scale.
  • High correlation with human preferences: Research (Zheng et al., 2023) has demonstrated that GPT-4 judgments achieve over 80% agreement with human preferences, comparable to inter-annotator agreement among humans.
  • Position bias is a known confound: LLMs exhibit systematic position bias in pairwise evaluations. The two-game swapping strategy is a principled debiasing technique that treats disagreements as ties, reducing false positives.
  • Reference-grounded evaluation: For tasks with objectively correct answers (math, coding), providing reference answers to the judge significantly improves evaluation accuracy by anchoring the assessment to ground truth.
  • Structured output extraction: Requiring the judge to produce scores in a specific format (e.g., rating) and extracting them via regex is more robust than parsing free-text judgments, though fallback patterns are needed to handle format deviations.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment