Principle:Explodinggradients Ragas LLM as Judge Metric
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| explodinggradients/ragas | LLM Evaluation, Metric Design | 2026-02-10 |
Overview
LLM as Judge Metric is the principle of using Large Language Models as evaluators that classify text outputs into discrete categories based on structured evaluation rubrics defined through prompt templates.
Description
Traditional software metrics rely on deterministic computations, but evaluating LLM outputs often requires subjective judgment about quality, correctness, or relevance. The LLM-as-Judge pattern addresses this by delegating evaluation to an LLM itself, constrained by structured output schemas that force classification into predefined categories.
Discrete Classification: The metric constrains the LLM judge to output one of a predefined set of allowed values (such as ["pass", "fail"], ["excellent", "good", "poor"], or any custom set of categories). This discrete classification approach converts subjective LLM judgment into quantifiable, comparable results. The constraint is enforced through a dynamically generated Pydantic response model that uses Literal types to restrict possible output values.
Prompt Templates with Placeholders: Evaluation criteria are defined through prompt templates containing {placeholder} variables. At scoring time, these placeholders are filled with the actual evaluation inputs (such as the user query, the LLM response, and reference answers). This design allows a single metric definition to be applied across many different evaluation scenarios simply by changing the prompt text.
Structured Output with Reasoning: The LLM judge returns not just a classification value but also a reasoning explanation. The response model always includes both a value field (the discrete classification) and a reason field (the LLM's explanation for its judgment). This dual output provides both a machine-readable score and a human-readable justification, enabling debugging and quality assurance of the evaluation process.
Correlation with Human Judgment: The metric provides a get_correlation() method that computes Cohen's Kappa score between gold-standard human labels and the LLM's predictions. This enables quantitative validation of whether the LLM judge agrees with human evaluators, which is essential for establishing trust in automated evaluation.
Persistence and Loading: Metric configurations (including prompts, allowed values, and response model schemas) can be serialized to JSON files and loaded back, enabling metric sharing across teams and reproducible evaluation setups.
Usage
Use the LLM as Judge Metric principle when:
- Evaluating LLM outputs on subjective criteria like correctness, relevance, or quality
- Needing binary pass/fail assessments or multi-category classifications
- Requiring both a score and an explanation for each evaluation
- Validating LLM judge accuracy against human-labeled gold standards
- Building reusable evaluation rubrics that can be shared and versioned
Theoretical Basis
The theoretical foundation combines constrained LLM generation with inter-rater reliability measurement:
PROCEDURE llm_as_judge(prompt_template, allowed_values, llm, **inputs):
1. Define the response schema:
Create a Pydantic model with:
value: Literal[allowed_values] -- constrained to predefined categories
reason: str -- explanation for the judgment
2. Prepare the evaluation prompt:
Fill prompt_template placeholders with input values
Example: "Evaluate the response: {response}" becomes
"Evaluate the response: Python is a language."
3. Generate the structured judgment:
Call llm.generate(prompt, response_model=ResponseSchema)
The LLM is constrained to return only valid values
4. Package the result:
Return MetricResult(value=response.value, reason=response.reason)
PROCEDURE validate_judge(gold_labels, predictions):
1. Compute Cohen's Kappa between gold_labels and predictions
2. Return the inter-rater agreement score
Kappa > 0.6 indicates substantial agreement
Kappa > 0.8 indicates almost perfect agreement
This approach transforms subjective evaluation into a structured, reproducible process where the LLM judge's quality can be quantitatively measured against human baselines.