Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Evaluator Configuration

From Leeroopedia

Overview

Evaluator Configuration defines how LLM-as-judge evaluation metrics are set up to assess the quality of RAG pipeline outputs. LlamaIndex provides three core evaluators — FaithfulnessEvaluator, RelevancyEvaluator, and CorrectnessEvaluator — each targeting a distinct dimension of response quality. Configuring these evaluators correctly is essential for building a comprehensive, automated evaluation pipeline that catches hallucinations, irrelevant retrievals, and incorrect answers.

The LLM-as-judge paradigm uses a (typically stronger) LLM to evaluate the outputs of another LLM, replacing or augmenting expensive human evaluation with scalable, consistent automated assessment.

RAG Evaluation LLM-as-Judge Evaluation Metrics Quality Assurance

LLM-as-Judge Evaluation Metrics

The three core evaluators form a complementary evaluation suite, each measuring a different failure mode:

Evaluator Measures Failure Mode Detected Output Type
FaithfulnessEvaluator Whether the response is supported by the retrieved context Hallucination — the model generates claims not present in the source Boolean (passing/failing)
RelevancyEvaluator Whether the retrieved context and response are relevant to the query Irrelevant retrieval — the system retrieves off-topic content Boolean (passing/failing)
CorrectnessEvaluator Whether the response matches a reference answer in quality and accuracy Incorrect answer — the model produces a wrong or low-quality answer Score (1.0–5.0) with threshold

Faithfulness: Hallucination Detection

Faithfulness evaluation answers the question: "Is every claim in the response supported by the retrieved context?" This is the most critical evaluator for production RAG systems because hallucination — generating plausible but unsupported information — is the primary risk of LLM-based generation.

The evaluator works by:

  • Taking the response and the source contexts as input
  • Asking the judge LLM to verify each claim in the response against the provided contexts
  • Returning a binary pass/fail verdict with explanatory feedback

Key configuration considerations:

  • eval_template — controls how the faithfulness check is framed to the judge LLM; the default template works well for most cases
  • refine_template — used when context is too long for a single evaluation call, enabling iterative refinement
  • raise_error — when True, raises an exception on evaluation failure rather than returning a result; useful for strict pipeline enforcement

Relevancy: Query-Context-Response Alignment

Relevancy evaluation answers: "Is the retrieved context relevant to the query, and does the response actually address the query using that context?" This evaluator catches cases where the retrieval step returns tangentially related or completely off-topic content.

The evaluation considers the three-way alignment between query, context, and response:

  • The context must be relevant to the query (retrieval quality)
  • The response must address the query (generation quality)
  • The response must draw from the provided context (grounding quality)

Configuration mirrors FaithfulnessEvaluator with eval_template, refine_template, and raise_error parameters.

Correctness: Answer Quality Scoring

Correctness evaluation answers: "How well does the generated response match a known reference answer?" Unlike faithfulness and relevancy which produce binary verdicts, correctness produces a numerical score (typically 1.0 to 5.0), enabling nuanced quality assessment.

Key configuration considerations:

  • score_threshold — the minimum score (default 4.0) for a response to be considered "passing"; allows tuning evaluation strictness
  • parser_function — a callable that extracts the score from the judge LLM's output; can be customized for different scoring schemes
  • eval_template — the prompt template that instructs the judge how to compare the response against the reference answer

Correctness evaluation requires a reference answer (ground truth), making it dependent on having an evaluation dataset. This distinguishes it from faithfulness and relevancy which can operate without ground truth.

Choosing the Right Judge LLM

The quality of evaluation depends heavily on the judge LLM:

  • Stronger models produce better evaluations — GPT-4 or Claude are preferred over smaller models for judging
  • Temperature should be 0 — evaluation should be deterministic and consistent
  • The judge should differ from the generator — using the same model to generate and evaluate introduces bias
  • Cost trade-off — stronger judge models cost more per evaluation call but provide more reliable results

Evaluation Template Customization

All three evaluators accept custom prompt templates, enabling:

  • Domain-specific evaluation criteria — adding industry or task-specific requirements to the evaluation prompt
  • Multilingual evaluation — adapting templates for non-English evaluation
  • Stricter or lenient evaluation — modifying how strictly the judge interprets faithfulness, relevancy, or correctness

The EvaluationResult Contract

All evaluators return an EvaluationResult object with a consistent interface:

  • passing — boolean indicating whether the response passed evaluation
  • score — optional numeric score (primarily used by CorrectnessEvaluator)
  • feedback — string explanation from the judge LLM justifying its verdict
  • query, response, contexts — the inputs that were evaluated

This uniform interface enables composability — results from different evaluators can be aggregated, compared, and analyzed using the same code paths.

Knowledge Sources

LlamaIndex Evaluation LlamaIndex Evaluator Modules

Related

2026-02-11 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment