Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Run llama Llama index Evaluator LLM Selection

From Leeroopedia
Knowledge Sources
Domains Evaluation, LLMs
Last Updated 2026-02-11 19:00 GMT

Overview

Best practice for selecting the evaluator LLM in LlamaIndex's evaluation pipeline: use a stronger model than the one being evaluated.

Description

LlamaIndex's evaluation framework (FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator) uses an LLM-as-judge pattern where a language model evaluates the quality of another model's outputs. The evaluator LLM is configured independently of the RAG pipeline's LLM, and the choice of evaluator model significantly affects evaluation reliability.

Usage

Apply this heuristic when setting up the Evaluation Pipeline. Specifically when:

  • Configuring `FaithfulnessEvaluator`, `RelevancyEvaluator`, or `CorrectnessEvaluator`
  • Choosing the LLM parameter for `BatchEvalRunner`
  • Comparing different RAG configurations

The Insight (Rule of Thumb)

  • Action: Pass a stronger LLM to the evaluator than the one used in your RAG pipeline.
  • Value: If your pipeline uses GPT-3.5, evaluate with GPT-4. If your pipeline uses GPT-4, evaluate with GPT-4o or a specialized evaluator.
  • Default Temperature: LlamaIndex uses `DEFAULT_TEMPERATURE = 0.1` for deterministic evaluation outputs.
  • Trade-off: Stronger evaluator models are more expensive per evaluation call, but produce more reliable quality assessments.

Reasoning

Self-evaluation bias: If the same model generates and evaluates responses, it tends to rate its own outputs more favorably. A stronger model can better identify subtle factual errors, hallucinations, and relevancy issues.

Evaluator independence: The LlamaIndex evaluator classes accept an independent `llm` parameter, explicitly decoupling the evaluator from the pipeline's global LLM. This design choice reflects the best practice of using separate models.

Low temperature: The framework-wide default of 0.1 temperature reduces variance in evaluation scores, making comparisons more reliable across runs.

Code evidence from `constants.py:3`:

DEFAULT_TEMPERATURE = 0.1

Evaluator LLM independence from `evaluation/faithfulness.py` and `evaluation/relevancy.py`:

class FaithfulnessEvaluator(BaseEvaluator):
    def __init__(
        self,
        llm: Optional[LLM] = None,  # Independent LLM parameter
        ...
    )

Retry pattern ensures reliable evaluation from `evaluation/batch_runner.py:11-14`:

@retry(
    reraise=True,
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment