Heuristic:Run llama Llama index Evaluator LLM Selection
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, LLMs |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
Best practice for selecting the evaluator LLM in LlamaIndex's evaluation pipeline: use a stronger model than the one being evaluated.
Description
LlamaIndex's evaluation framework (FaithfulnessEvaluator, RelevancyEvaluator, CorrectnessEvaluator) uses an LLM-as-judge pattern where a language model evaluates the quality of another model's outputs. The evaluator LLM is configured independently of the RAG pipeline's LLM, and the choice of evaluator model significantly affects evaluation reliability.
Usage
Apply this heuristic when setting up the Evaluation Pipeline. Specifically when:
- Configuring `FaithfulnessEvaluator`, `RelevancyEvaluator`, or `CorrectnessEvaluator`
- Choosing the LLM parameter for `BatchEvalRunner`
- Comparing different RAG configurations
The Insight (Rule of Thumb)
- Action: Pass a stronger LLM to the evaluator than the one used in your RAG pipeline.
- Value: If your pipeline uses GPT-3.5, evaluate with GPT-4. If your pipeline uses GPT-4, evaluate with GPT-4o or a specialized evaluator.
- Default Temperature: LlamaIndex uses `DEFAULT_TEMPERATURE = 0.1` for deterministic evaluation outputs.
- Trade-off: Stronger evaluator models are more expensive per evaluation call, but produce more reliable quality assessments.
Reasoning
Self-evaluation bias: If the same model generates and evaluates responses, it tends to rate its own outputs more favorably. A stronger model can better identify subtle factual errors, hallucinations, and relevancy issues.
Evaluator independence: The LlamaIndex evaluator classes accept an independent `llm` parameter, explicitly decoupling the evaluator from the pipeline's global LLM. This design choice reflects the best practice of using separate models.
Low temperature: The framework-wide default of 0.1 temperature reduces variance in evaluation scores, making comparisons more reliable across runs.
Code evidence from `constants.py:3`:
DEFAULT_TEMPERATURE = 0.1
Evaluator LLM independence from `evaluation/faithfulness.py` and `evaluation/relevancy.py`:
class FaithfulnessEvaluator(BaseEvaluator):
def __init__(
self,
llm: Optional[LLM] = None, # Independent LLM parameter
...
)
Retry pattern ensures reliable evaluation from `evaluation/batch_runner.py:11-14`:
@retry(
reraise=True,
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10),
)