Principle:Explodinggradients Ragas LLM as Judge Metric

Knowledge Sources	Domains	Last Updated
explodinggradients/ragas	LLM Evaluation, Metric Design	2026-02-10

Overview

LLM as Judge Metric is the principle of using Large Language Models as evaluators that classify text outputs into discrete categories based on structured evaluation rubrics defined through prompt templates.

Description

Traditional software metrics rely on deterministic computations, but evaluating LLM outputs often requires subjective judgment about quality, correctness, or relevance. The LLM-as-Judge pattern addresses this by delegating evaluation to an LLM itself, constrained by structured output schemas that force classification into predefined categories.

Discrete Classification: The metric constrains the LLM judge to output one of a predefined set of allowed values (such as ["pass", "fail"], ["excellent", "good", "poor"], or any custom set of categories). This discrete classification approach converts subjective LLM judgment into quantifiable, comparable results. The constraint is enforced through a dynamically generated Pydantic response model that uses Literal types to restrict possible output values.

Prompt Templates with Placeholders: Evaluation criteria are defined through prompt templates containing {placeholder} variables. At scoring time, these placeholders are filled with the actual evaluation inputs (such as the user query, the LLM response, and reference answers). This design allows a single metric definition to be applied across many different evaluation scenarios simply by changing the prompt text.

Structured Output with Reasoning: The LLM judge returns not just a classification value but also a reasoning explanation. The response model always includes both a value field (the discrete classification) and a reason field (the LLM's explanation for its judgment). This dual output provides both a machine-readable score and a human-readable justification, enabling debugging and quality assurance of the evaluation process.

Correlation with Human Judgment: The metric provides a get_correlation() method that computes Cohen's Kappa score between gold-standard human labels and the LLM's predictions. This enables quantitative validation of whether the LLM judge agrees with human evaluators, which is essential for establishing trust in automated evaluation.

Persistence and Loading: Metric configurations (including prompts, allowed values, and response model schemas) can be serialized to JSON files and loaded back, enabling metric sharing across teams and reproducible evaluation setups.

Usage

Use the LLM as Judge Metric principle when:

Evaluating LLM outputs on subjective criteria like correctness, relevance, or quality
Needing binary pass/fail assessments or multi-category classifications
Requiring both a score and an explanation for each evaluation
Validating LLM judge accuracy against human-labeled gold standards
Building reusable evaluation rubrics that can be shared and versioned

Theoretical Basis

The theoretical foundation combines constrained LLM generation with inter-rater reliability measurement:

PROCEDURE llm_as_judge(prompt_template, allowed_values, llm, **inputs):
    1. Define the response schema:
       Create a Pydantic model with:
           value: Literal[allowed_values]  -- constrained to predefined categories
           reason: str                     -- explanation for the judgment

    2. Prepare the evaluation prompt:
       Fill prompt_template placeholders with input values
       Example: "Evaluate the response: {response}" becomes
                "Evaluate the response: Python is a language."

    3. Generate the structured judgment:
       Call llm.generate(prompt, response_model=ResponseSchema)
       The LLM is constrained to return only valid values

    4. Package the result:
       Return MetricResult(value=response.value, reason=response.reason)

PROCEDURE validate_judge(gold_labels, predictions):
    1. Compute Cohen's Kappa between gold_labels and predictions
    2. Return the inter-rater agreement score
       Kappa > 0.6 indicates substantial agreement
       Kappa > 0.8 indicates almost perfect agreement

This approach transforms subjective evaluation into a structured, reproducible process where the LLM judge's quality can be quantitatively measured against human baselines.

Related Pages

Implementation:Explodinggradients_Ragas_DiscreteMetric_Class

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment