Principle:Evidentlyai Evidently LLM Judge Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, NLP, AI_Safety |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
An LLM-as-judge evaluation mechanism that uses large language models to assess text quality attributes at the row level.
Description
LLM Judge Evaluation uses an external LLM (e.g., GPT-4o-mini) to evaluate specific quality attributes of text data on a per-row basis. Unlike rule-based descriptors that use pattern matching or statistical models, LLM judges apply natural language understanding to assess nuanced properties:
- Negativity: Detects negative sentiment, hostility, or toxicity in text
- Decline: Detects when an LLM refuses or declines to answer a question
- PII Detection: Identifies personally identifiable information
- Bias Detection: Identifies biased or discriminatory content
- Toxicity: Detects toxic or harmful language
LLM judges send each row to an external LLM API with a structured evaluation prompt and parse the response into a category label and optional score. This enables monitoring of LLM system outputs for safety, quality, and compliance.
Usage
Use this principle when evaluating LLM-powered system outputs for safety, quality, or compliance properties that cannot be reliably measured with rule-based approaches. Requires an LLM API key (e.g., OpenAI) and incurs API costs proportional to dataset size.
Theoretical Basis
LLM-as-judge follows the meta-evaluation paradigm where one model evaluates another:
# Pseudocode: LLM judge evaluation
for row in dataset:
prompt = format_evaluation_prompt(row[text_column], criteria="negativity")
response = llm_api.generate(prompt)
row["negativity_label"] = parse_category(response)
row["negativity_score"] = parse_score(response)
row["negativity_reasoning"] = parse_reasoning(response)
The evaluation prompt is structured to elicit consistent, parseable responses with category labels, numerical scores, and optional reasoning.