Implementation:Run llama Llama index EvaluatorEvaluationDataset
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Datasets, Benchmarking |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
This module implements the evaluator evaluation dataset system for LlamaIndex, providing concrete classes for labelled evaluation data examples, evaluation predictions, and both standard and pairwise evaluator datasets used to benchmark evaluator quality.
Description
The evaluator_evaluation.py module builds on the base dataset abstractions to provide a complete framework for evaluating the quality of LlamaIndex evaluators themselves. It defines two evaluation paradigms: standard evaluation and pairwise evaluation.
Standard Evaluation Classes:
EvaluatorExamplePrediction extends BaseLlamaExamplePrediction and stores the output of running an evaluator on a single example. It includes feedback (the evaluator's textual feedback), score (a numeric score), invalid_prediction (a boolean flag), and invalid_reason (explanation if the prediction failed).
LabelledEvaluatorDataExample extends BaseLlamaDataExample and represents a single labelled evaluation example with rich metadata. It contains query, query_by (a CreatedBy instance), contexts (optional list of context strings), answer (the response to be evaluated), answer_by, ground_truth_answer, ground_truth_answer_by, reference_feedback, reference_score, and reference_evaluation_by. This structure allows comparing an evaluator's output against human or AI reference evaluations.
EvaluatorPredictionDataset extends BaseLlamaPredictionDataset and provides to_pandas for converting predictions into a DataFrame with feedback and score columns.
LabelledEvaluatorDataset extends BaseLlamaDataset[BaseEvaluator] and is the primary dataset class for standard evaluator benchmarking. It implements _predict_example and _apredict_example which call the evaluator's evaluate and aevaluate methods respectively, passing the query, answer, contexts, and ground truth answer. Both methods handle exceptions gracefully by returning an EvaluatorExamplePrediction with invalid_prediction=True. The to_pandas method converts all examples to a DataFrame with columns for all metadata fields.
Pairwise Evaluation Classes:
PairwiseEvaluatorExamplePrediction adds an evaluation_source field of type EvaluationSource to track whether the evaluation result came from the original or flipped ordering of responses.
LabelledPairwiseEvaluatorDataExample extends LabelledEvaluatorDataExample with a second_answer and second_answer_by field for the comparison response.
PairwiseEvaluatorPredictionDataset provides to_pandas with columns for feedback, score, and ordering.
LabelledPairwiseEvaluatorDataset implements prediction methods that pass both response and second_response to the evaluator's evaluate method, and captures the pairwise_source from the evaluation result.
The module also provides American English aliases: LabeledEvaluatorDataExample, LabeledEvaluatorDataset, LabeledPairwiseEvaluatorDataExample, and LabeledPairwiseEvaluatorDataset.
Usage
Use this module to benchmark evaluator quality by comparing evaluator outputs against reference evaluations. Use LabelledEvaluatorDataset for standard single-response evaluation and LabelledPairwiseEvaluatorDataset for comparing two responses. Load datasets from JSON, run evaluators against them with make_predictions_with or amake_predictions_with, and analyze results with to_pandas.
Code Reference
Source Location
- Repository: Run_llama_Llama_index
- File: llama-index-core/llama_index/core/llama_dataset/evaluator_evaluation.py
- Lines: 1-499
Signature
class EvaluatorExamplePrediction(BaseLlamaExamplePrediction):
feedback: str = Field(default_factory=str)
score: Optional[float] = Field(default=None)
invalid_prediction: bool = Field(default=False)
invalid_reason: Optional[str] = Field(default=None)
class LabelledEvaluatorDataExample(BaseLlamaDataExample):
query: str = Field(default_factory=str)
query_by: Optional[CreatedBy] = Field(default=None)
contexts: Optional[List[str]] = Field(default=None)
answer: str = Field(default_factory=str)
answer_by: Optional[CreatedBy] = Field(default=None)
ground_truth_answer: Optional[str] = Field(default=None)
ground_truth_answer_by: Optional[CreatedBy] = Field(default=None)
reference_feedback: Optional[str] = Field(default=None)
reference_score: float = Field(default_factory=float)
reference_evaluation_by: Optional[CreatedBy] = Field(default=None)
class EvaluatorPredictionDataset(BaseLlamaPredictionDataset):
_prediction_type = EvaluatorExamplePrediction
class LabelledEvaluatorDataset(BaseLlamaDataset[BaseEvaluator]):
_example_type = LabelledEvaluatorDataExample
class PairwiseEvaluatorExamplePrediction(BaseLlamaExamplePrediction):
feedback: str = Field(default_factory=str)
score: Optional[float] = Field(default=None)
evaluation_source: Optional[EvaluationSource] = Field(default=None)
invalid_prediction: bool = Field(default=False)
invalid_reason: Optional[str] = Field(default=None)
class LabelledPairwiseEvaluatorDataExample(LabelledEvaluatorDataExample):
second_answer: str = Field(default_factory=str)
second_answer_by: Optional[CreatedBy] = Field(default=None)
class PairwiseEvaluatorPredictionDataset(BaseLlamaPredictionDataset):
_prediction_type = PairwiseEvaluatorExamplePrediction
class LabelledPairwiseEvaluatorDataset(BaseLlamaDataset[BaseEvaluator]):
_example_type = LabelledPairwiseEvaluatorDataExample
Import
from llama_index.core.llama_dataset.evaluator_evaluation import (
EvaluatorExamplePrediction,
LabelledEvaluatorDataExample,
EvaluatorPredictionDataset,
LabelledEvaluatorDataset,
PairwiseEvaluatorExamplePrediction,
LabelledPairwiseEvaluatorDataExample,
PairwiseEvaluatorPredictionDataset,
LabelledPairwiseEvaluatorDataset,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| query | str | Yes (LabelledEvaluatorDataExample) | The user query for the evaluation example |
| answer | str | Yes | The response to be evaluated |
| contexts | Optional[List[str]] | No | Context strings used to generate the answer |
| ground_truth_answer | Optional[str] | No | The reference ground truth answer for comparison |
| reference_feedback | Optional[str] | No | The reference evaluator feedback (ground truth) |
| reference_score | float | Yes | The reference evaluator score (ground truth) |
| second_answer | str | Yes (Pairwise) | The second response for pairwise comparison |
| predictor | BaseEvaluator | Yes (for make_predictions_with) | The evaluator to benchmark |
| sleep_time_in_seconds | int | No | Delay between predictions to avoid rate limits |
Outputs
| Name | Type | Description |
|---|---|---|
| return (make_predictions_with) | EvaluatorPredictionDataset or PairwiseEvaluatorPredictionDataset | Dataset of evaluator predictions |
| feedback | str | The evaluator's textual feedback for an example |
| score | Optional[float] | The evaluator's numeric score for an example |
| invalid_prediction | bool | Whether the prediction encountered an error |
| invalid_reason | Optional[str] | Explanation of why a prediction is invalid |
| evaluation_source | Optional[EvaluationSource] | Whether the pairwise result came from original or flipped order |
| return (to_pandas) | pandas.DataFrame | DataFrame with evaluation results |
Usage Examples
Basic Usage
from llama_index.core.llama_dataset.evaluator_evaluation import (
LabelledEvaluatorDataExample,
LabelledEvaluatorDataset,
EvaluatorPredictionDataset,
)
from llama_index.core.llama_dataset.base import CreatedBy, CreatedByType
# Create a labelled evaluation example
example = LabelledEvaluatorDataExample(
query="What is LlamaIndex?",
query_by=CreatedBy(type=CreatedByType.HUMAN),
answer="LlamaIndex is a data framework for LLMs.",
answer_by=CreatedBy(type=CreatedByType.AI, model_name="gpt-4"),
ground_truth_answer="LlamaIndex is a data framework for building LLM applications.",
reference_feedback="The answer is mostly correct but incomplete.",
reference_score=0.8,
reference_evaluation_by=CreatedBy(type=CreatedByType.HUMAN),
)
# Create a dataset and generate predictions
dataset = LabelledEvaluatorDataset(examples=[example])
# Use with an evaluator
from llama_index.core.evaluation import CorrectnessEvaluator
evaluator = CorrectnessEvaluator()
predictions = dataset.make_predictions_with(evaluator, show_progress=True)
df = predictions.to_pandas()
print(df[["feedback", "score"]])
Pairwise Evaluation
from llama_index.core.llama_dataset.evaluator_evaluation import (
LabelledPairwiseEvaluatorDataExample,
LabelledPairwiseEvaluatorDataset,
)
from llama_index.core.llama_dataset.base import CreatedBy, CreatedByType
# Create a pairwise evaluation example
pairwise_example = LabelledPairwiseEvaluatorDataExample(
query="What is LlamaIndex?",
answer="LlamaIndex is a data framework.",
second_answer="LlamaIndex is a Python library for building LLM applications.",
ground_truth_answer="LlamaIndex is a data framework for building LLM apps.",
reference_score=0.0,
reference_feedback="The second answer is more complete.",
)
pairwise_dataset = LabelledPairwiseEvaluatorDataset(examples=[pairwise_example])
# Save and load
pairwise_dataset.save_json("pairwise_eval.json")
loaded = LabelledPairwiseEvaluatorDataset.from_json("pairwise_eval.json")