Implementation:Run llama Llama index Evaluator Init
Overview
Evaluator_Init documents the initialization and usage of the three core LLM-as-judge evaluators in LlamaIndex: FaithfulnessEvaluator, RelevancyEvaluator, and CorrectnessEvaluator. Each evaluator inherits from BaseEvaluator and provides an aevaluate method that returns a standardized EvaluationResult.
Principle:Run_llama_Llama_index_Evaluator_Configuration
RAG Evaluation LLM-as-Judge LlamaIndex API
Source Files
- FaithfulnessEvaluator:
llama-index-core/llama_index/core/evaluation/faithfulness.py, Lines 98–201 - RelevancyEvaluator:
llama-index-core/llama_index/core/evaluation/relevancy.py, Lines 42–141 - CorrectnessEvaluator:
llama-index-core/llama_index/core/evaluation/correctness.py, Lines 69–153
Import Statement
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
CorrectnessEvaluator,
)
FaithfulnessEvaluator
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| llm | Optional[LLM] |
None |
Judge LLM for evaluation; falls back to Settings.llm if not provided |
| raise_error | bool |
False |
If True, raises an exception when evaluation fails instead of returning a result
|
| eval_template | Optional[BasePromptTemplate] |
None |
Custom prompt template for the faithfulness check |
| refine_template | Optional[BasePromptTemplate] |
None |
Template for iterative refinement when context exceeds a single call |
Evaluation Method
| Method | Parameters | Return Type |
|---|---|---|
| aevaluate | query (Optional[str]), response (Optional[str]), contexts (Optional[Sequence[str]]), **kwargs |
EvaluationResult
|
Example
from llama_index.core.evaluation import FaithfulnessEvaluator
from llama_index.llms.openai import OpenAI
# Initialize with a strong judge model
judge_llm = OpenAI(model="gpt-4", temperature=0.0)
faithfulness_evaluator = FaithfulnessEvaluator(llm=judge_llm)
# Evaluate a response against its source contexts
result = await faithfulness_evaluator.aevaluate(
query="What is the capital of France?",
response="The capital of France is Paris, founded in 250 BC.",
contexts=["Paris is the capital and largest city of France."],
)
print(f"Passing: {result.passing}") # True or False
print(f"Feedback: {result.feedback}") # Explanation from judge
RelevancyEvaluator
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| llm | Optional[LLM] |
None |
Judge LLM for evaluation; falls back to Settings.llm if not provided |
| raise_error | bool |
False |
If True, raises an exception when evaluation fails
|
| eval_template | Optional[BasePromptTemplate] |
None |
Custom prompt template for the relevancy check |
| refine_template | Optional[BasePromptTemplate] |
None |
Template for iterative refinement when context is large |
Evaluation Method
| Method | Parameters | Return Type |
|---|---|---|
| aevaluate | query (Optional[str]), response (Optional[str]), contexts (Optional[Sequence[str]]), **kwargs |
EvaluationResult
|
Example
from llama_index.core.evaluation import RelevancyEvaluator
from llama_index.llms.openai import OpenAI
judge_llm = OpenAI(model="gpt-4", temperature=0.0)
relevancy_evaluator = RelevancyEvaluator(llm=judge_llm)
# Check if context and response are relevant to the query
result = await relevancy_evaluator.aevaluate(
query="How does photosynthesis work?",
response="Photosynthesis converts sunlight into chemical energy.",
contexts=[
"Photosynthesis is the process by which plants convert "
"light energy into chemical energy stored in glucose."
],
)
print(f"Passing: {result.passing}")
print(f"Feedback: {result.feedback}")
CorrectnessEvaluator
Constructor Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| llm | Optional[LLM] |
None |
Judge LLM for evaluation; falls back to Settings.llm if not provided |
| eval_template | Optional[BasePromptTemplate] |
None |
Custom prompt template for the correctness comparison |
| score_threshold | float |
4.0 |
Minimum score (out of 5.0) for a response to be considered passing |
| parser_function | Optional[Callable] |
None |
Custom function to parse score from judge LLM output |
Evaluation Method
| Method | Parameters | Return Type |
|---|---|---|
| aevaluate | query (Optional[str]), response (Optional[str]), contexts (Optional[Sequence[str]]), reference (Optional[str]), **kwargs |
EvaluationResult
|
Note: CorrectnessEvaluator requires a reference parameter (ground truth answer) that the other evaluators do not.
Example
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.openai import OpenAI
judge_llm = OpenAI(model="gpt-4", temperature=0.0)
correctness_evaluator = CorrectnessEvaluator(
llm=judge_llm,
score_threshold=4.0, # Require score >= 4.0 to pass
)
# Compare response against a reference answer
result = await correctness_evaluator.aevaluate(
query="What causes rain?",
response="Rain is caused by water vapor condensing in clouds.",
reference="Rain occurs when water vapor in the atmosphere condenses "
"into water droplets within clouds, which then fall to "
"the ground when they become heavy enough.",
)
print(f"Passing: {result.passing}") # True if score >= 4.0
print(f"Score: {result.score}") # Numeric score (1.0-5.0)
print(f"Feedback: {result.feedback}") # Detailed comparison
Configuring All Three Evaluators Together
from llama_index.core.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
CorrectnessEvaluator,
)
from llama_index.llms.openai import OpenAI
# Use a strong model as judge
judge_llm = OpenAI(model="gpt-4", temperature=0.0)
# Configure all three evaluators
evaluators = {
"faithfulness": FaithfulnessEvaluator(llm=judge_llm),
"relevancy": RelevancyEvaluator(llm=judge_llm),
"correctness": CorrectnessEvaluator(
llm=judge_llm,
score_threshold=4.0,
),
}
# These can now be passed to BatchEvalRunner for parallel evaluation
Using Evaluators with a Query Engine
from llama_index.core.evaluation import FaithfulnessEvaluator
judge_llm = OpenAI(model="gpt-4", temperature=0.0)
faithfulness_evaluator = FaithfulnessEvaluator(llm=judge_llm)
# Query the engine
query = "What are the benefits of exercise?"
response = query_engine.query(query)
# Evaluate using the response object directly
# The evaluator extracts source_nodes as contexts
result = await faithfulness_evaluator.aevaluate_response(
query=query,
response=response,
)
print(f"Faithful: {result.passing}")
Knowledge Sources
LlamaIndex Evaluation LlamaIndex Evaluator Modules
Environment:Run_llama_Llama_index_Python_LlamaIndex_Core Environment:Run_llama_Llama_index_OpenAI_API_Configuration Heuristic:Run_llama_Llama_index_Evaluator_LLM_Selection