Principle:Run llama Llama index Evaluator Configuration

Overview

Evaluator Configuration defines how LLM-as-judge evaluation metrics are set up to assess the quality of RAG pipeline outputs. LlamaIndex provides three core evaluators — FaithfulnessEvaluator, RelevancyEvaluator, and CorrectnessEvaluator — each targeting a distinct dimension of response quality. Configuring these evaluators correctly is essential for building a comprehensive, automated evaluation pipeline that catches hallucinations, irrelevant retrievals, and incorrect answers.

The LLM-as-judge paradigm uses a (typically stronger) LLM to evaluate the outputs of another LLM, replacing or augmenting expensive human evaluation with scalable, consistent automated assessment.

RAG Evaluation LLM-as-Judge Evaluation Metrics Quality Assurance

LLM-as-Judge Evaluation Metrics

The three core evaluators form a complementary evaluation suite, each measuring a different failure mode:

Evaluator	Measures	Failure Mode Detected	Output Type
FaithfulnessEvaluator	Whether the response is supported by the retrieved context	Hallucination — the model generates claims not present in the source	Boolean (passing/failing)
RelevancyEvaluator	Whether the retrieved context and response are relevant to the query	Irrelevant retrieval — the system retrieves off-topic content	Boolean (passing/failing)
CorrectnessEvaluator	Whether the response matches a reference answer in quality and accuracy	Incorrect answer — the model produces a wrong or low-quality answer	Score (1.0–5.0) with threshold

Faithfulness: Hallucination Detection

Faithfulness evaluation answers the question: "Is every claim in the response supported by the retrieved context?" This is the most critical evaluator for production RAG systems because hallucination — generating plausible but unsupported information — is the primary risk of LLM-based generation.

The evaluator works by:

Taking the response and the source contexts as input
Asking the judge LLM to verify each claim in the response against the provided contexts
Returning a binary pass/fail verdict with explanatory feedback

Key configuration considerations:

eval_template — controls how the faithfulness check is framed to the judge LLM; the default template works well for most cases
refine_template — used when context is too long for a single evaluation call, enabling iterative refinement
raise_error — when True, raises an exception on evaluation failure rather than returning a result; useful for strict pipeline enforcement

Relevancy: Query-Context-Response Alignment

Relevancy evaluation answers: "Is the retrieved context relevant to the query, and does the response actually address the query using that context?" This evaluator catches cases where the retrieval step returns tangentially related or completely off-topic content.

The evaluation considers the three-way alignment between query, context, and response:

The context must be relevant to the query (retrieval quality)
The response must address the query (generation quality)
The response must draw from the provided context (grounding quality)

Configuration mirrors FaithfulnessEvaluator with eval_template, refine_template, and raise_error parameters.

Correctness: Answer Quality Scoring

Correctness evaluation answers: "How well does the generated response match a known reference answer?" Unlike faithfulness and relevancy which produce binary verdicts, correctness produces a numerical score (typically 1.0 to 5.0), enabling nuanced quality assessment.

Key configuration considerations:

score_threshold — the minimum score (default 4.0) for a response to be considered "passing"; allows tuning evaluation strictness
parser_function — a callable that extracts the score from the judge LLM's output; can be customized for different scoring schemes
eval_template — the prompt template that instructs the judge how to compare the response against the reference answer

Correctness evaluation requires a reference answer (ground truth), making it dependent on having an evaluation dataset. This distinguishes it from faithfulness and relevancy which can operate without ground truth.

Choosing the Right Judge LLM

The quality of evaluation depends heavily on the judge LLM:

Stronger models produce better evaluations — GPT-4 or Claude are preferred over smaller models for judging
Temperature should be 0 — evaluation should be deterministic and consistent
The judge should differ from the generator — using the same model to generate and evaluate introduces bias
Cost trade-off — stronger judge models cost more per evaluation call but provide more reliable results

Evaluation Template Customization

All three evaluators accept custom prompt templates, enabling:

Domain-specific evaluation criteria — adding industry or task-specific requirements to the evaluation prompt
Multilingual evaluation — adapting templates for non-English evaluation
Stricter or lenient evaluation — modifying how strictly the judge interprets faithfulness, relevancy, or correctness

The EvaluationResult Contract

All evaluators return an EvaluationResult object with a consistent interface:

passing — boolean indicating whether the response passed evaluation
score — optional numeric score (primarily used by CorrectnessEvaluator)
feedback — string explanation from the judge LLM justifying its verdict
query, response, contexts — the inputs that were evaluated

This uniform interface enables composability — results from different evaluators can be aggregated, compared, and analyzed using the same code paths.

Knowledge Sources

LlamaIndex Evaluation LlamaIndex Evaluator Modules

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment