Principle:Run llama Llama index Evaluator Configuration
Overview
Evaluator Configuration defines how LLM-as-judge evaluation metrics are set up to assess the quality of RAG pipeline outputs. LlamaIndex provides three core evaluators — FaithfulnessEvaluator, RelevancyEvaluator, and CorrectnessEvaluator — each targeting a distinct dimension of response quality. Configuring these evaluators correctly is essential for building a comprehensive, automated evaluation pipeline that catches hallucinations, irrelevant retrievals, and incorrect answers.
The LLM-as-judge paradigm uses a (typically stronger) LLM to evaluate the outputs of another LLM, replacing or augmenting expensive human evaluation with scalable, consistent automated assessment.
RAG Evaluation LLM-as-Judge Evaluation Metrics Quality Assurance
LLM-as-Judge Evaluation Metrics
The three core evaluators form a complementary evaluation suite, each measuring a different failure mode:
| Evaluator | Measures | Failure Mode Detected | Output Type |
|---|---|---|---|
| FaithfulnessEvaluator | Whether the response is supported by the retrieved context | Hallucination — the model generates claims not present in the source | Boolean (passing/failing) |
| RelevancyEvaluator | Whether the retrieved context and response are relevant to the query | Irrelevant retrieval — the system retrieves off-topic content | Boolean (passing/failing) |
| CorrectnessEvaluator | Whether the response matches a reference answer in quality and accuracy | Incorrect answer — the model produces a wrong or low-quality answer | Score (1.0–5.0) with threshold |
Faithfulness: Hallucination Detection
Faithfulness evaluation answers the question: "Is every claim in the response supported by the retrieved context?" This is the most critical evaluator for production RAG systems because hallucination — generating plausible but unsupported information — is the primary risk of LLM-based generation.
The evaluator works by:
- Taking the response and the source contexts as input
- Asking the judge LLM to verify each claim in the response against the provided contexts
- Returning a binary pass/fail verdict with explanatory feedback
Key configuration considerations:
- eval_template — controls how the faithfulness check is framed to the judge LLM; the default template works well for most cases
- refine_template — used when context is too long for a single evaluation call, enabling iterative refinement
- raise_error — when
True, raises an exception on evaluation failure rather than returning a result; useful for strict pipeline enforcement
Relevancy: Query-Context-Response Alignment
Relevancy evaluation answers: "Is the retrieved context relevant to the query, and does the response actually address the query using that context?" This evaluator catches cases where the retrieval step returns tangentially related or completely off-topic content.
The evaluation considers the three-way alignment between query, context, and response:
- The context must be relevant to the query (retrieval quality)
- The response must address the query (generation quality)
- The response must draw from the provided context (grounding quality)
Configuration mirrors FaithfulnessEvaluator with eval_template, refine_template, and raise_error parameters.
Correctness: Answer Quality Scoring
Correctness evaluation answers: "How well does the generated response match a known reference answer?" Unlike faithfulness and relevancy which produce binary verdicts, correctness produces a numerical score (typically 1.0 to 5.0), enabling nuanced quality assessment.
Key configuration considerations:
- score_threshold — the minimum score (default 4.0) for a response to be considered "passing"; allows tuning evaluation strictness
- parser_function — a callable that extracts the score from the judge LLM's output; can be customized for different scoring schemes
- eval_template — the prompt template that instructs the judge how to compare the response against the reference answer
Correctness evaluation requires a reference answer (ground truth), making it dependent on having an evaluation dataset. This distinguishes it from faithfulness and relevancy which can operate without ground truth.
Choosing the Right Judge LLM
The quality of evaluation depends heavily on the judge LLM:
- Stronger models produce better evaluations — GPT-4 or Claude are preferred over smaller models for judging
- Temperature should be 0 — evaluation should be deterministic and consistent
- The judge should differ from the generator — using the same model to generate and evaluate introduces bias
- Cost trade-off — stronger judge models cost more per evaluation call but provide more reliable results
Evaluation Template Customization
All three evaluators accept custom prompt templates, enabling:
- Domain-specific evaluation criteria — adding industry or task-specific requirements to the evaluation prompt
- Multilingual evaluation — adapting templates for non-English evaluation
- Stricter or lenient evaluation — modifying how strictly the judge interprets faithfulness, relevancy, or correctness
The EvaluationResult Contract
All evaluators return an EvaluationResult object with a consistent interface:
- passing — boolean indicating whether the response passed evaluation
- score — optional numeric score (primarily used by CorrectnessEvaluator)
- feedback — string explanation from the judge LLM justifying its verdict
- query, response, contexts — the inputs that were evaluated
This uniform interface enables composability — results from different evaluators can be aggregated, compared, and analyzed using the same code paths.
Knowledge Sources
LlamaIndex Evaluation LlamaIndex Evaluator Modules