Implementation:Run llama Llama index EvaluatorEvaluationDataset

Knowledge Sources	Run_llama_Llama_index
Domains	Evaluation, Datasets, Benchmarking
Last Updated	2026-02-11 19:00 GMT

Overview

This module implements the evaluator evaluation dataset system for LlamaIndex, providing concrete classes for labelled evaluation data examples, evaluation predictions, and both standard and pairwise evaluator datasets used to benchmark evaluator quality.

Description

The evaluator_evaluation.py module builds on the base dataset abstractions to provide a complete framework for evaluating the quality of LlamaIndex evaluators themselves. It defines two evaluation paradigms: standard evaluation and pairwise evaluation.

Standard Evaluation Classes:

EvaluatorExamplePrediction extends BaseLlamaExamplePrediction and stores the output of running an evaluator on a single example. It includes feedback (the evaluator's textual feedback), score (a numeric score), invalid_prediction (a boolean flag), and invalid_reason (explanation if the prediction failed).

LabelledEvaluatorDataExample extends BaseLlamaDataExample and represents a single labelled evaluation example with rich metadata. It contains query, query_by (a CreatedBy instance), contexts (optional list of context strings), answer (the response to be evaluated), answer_by, ground_truth_answer, ground_truth_answer_by, reference_feedback, reference_score, and reference_evaluation_by. This structure allows comparing an evaluator's output against human or AI reference evaluations.

EvaluatorPredictionDataset extends BaseLlamaPredictionDataset and provides to_pandas for converting predictions into a DataFrame with feedback and score columns.

LabelledEvaluatorDataset extends BaseLlamaDataset[BaseEvaluator] and is the primary dataset class for standard evaluator benchmarking. It implements _predict_example and _apredict_example which call the evaluator's evaluate and aevaluate methods respectively, passing the query, answer, contexts, and ground truth answer. Both methods handle exceptions gracefully by returning an EvaluatorExamplePrediction with invalid_prediction=True. The to_pandas method converts all examples to a DataFrame with columns for all metadata fields.

Pairwise Evaluation Classes:

PairwiseEvaluatorExamplePrediction adds an evaluation_source field of type EvaluationSource to track whether the evaluation result came from the original or flipped ordering of responses.

LabelledPairwiseEvaluatorDataExample extends LabelledEvaluatorDataExample with a second_answer and second_answer_by field for the comparison response.

PairwiseEvaluatorPredictionDataset provides to_pandas with columns for feedback, score, and ordering.

LabelledPairwiseEvaluatorDataset implements prediction methods that pass both response and second_response to the evaluator's evaluate method, and captures the pairwise_source from the evaluation result.

The module also provides American English aliases: LabeledEvaluatorDataExample, LabeledEvaluatorDataset, LabeledPairwiseEvaluatorDataExample, and LabeledPairwiseEvaluatorDataset.

Usage

Use this module to benchmark evaluator quality by comparing evaluator outputs against reference evaluations. Use LabelledEvaluatorDataset for standard single-response evaluation and LabelledPairwiseEvaluatorDataset for comparing two responses. Load datasets from JSON, run evaluators against them with make_predictions_with or amake_predictions_with, and analyze results with to_pandas.

Code Reference

Source Location

Repository: Run_llama_Llama_index
File: llama-index-core/llama_index/core/llama_dataset/evaluator_evaluation.py
Lines: 1-499

Signature

class EvaluatorExamplePrediction(BaseLlamaExamplePrediction):
    feedback: str = Field(default_factory=str)
    score: Optional[float] = Field(default=None)
    invalid_prediction: bool = Field(default=False)
    invalid_reason: Optional[str] = Field(default=None)

class LabelledEvaluatorDataExample(BaseLlamaDataExample):
    query: str = Field(default_factory=str)
    query_by: Optional[CreatedBy] = Field(default=None)
    contexts: Optional[List[str]] = Field(default=None)
    answer: str = Field(default_factory=str)
    answer_by: Optional[CreatedBy] = Field(default=None)
    ground_truth_answer: Optional[str] = Field(default=None)
    ground_truth_answer_by: Optional[CreatedBy] = Field(default=None)
    reference_feedback: Optional[str] = Field(default=None)
    reference_score: float = Field(default_factory=float)
    reference_evaluation_by: Optional[CreatedBy] = Field(default=None)

class EvaluatorPredictionDataset(BaseLlamaPredictionDataset):
    _prediction_type = EvaluatorExamplePrediction

class LabelledEvaluatorDataset(BaseLlamaDataset[BaseEvaluator]):
    _example_type = LabelledEvaluatorDataExample

class PairwiseEvaluatorExamplePrediction(BaseLlamaExamplePrediction):
    feedback: str = Field(default_factory=str)
    score: Optional[float] = Field(default=None)
    evaluation_source: Optional[EvaluationSource] = Field(default=None)
    invalid_prediction: bool = Field(default=False)
    invalid_reason: Optional[str] = Field(default=None)

class LabelledPairwiseEvaluatorDataExample(LabelledEvaluatorDataExample):
    second_answer: str = Field(default_factory=str)
    second_answer_by: Optional[CreatedBy] = Field(default=None)

class PairwiseEvaluatorPredictionDataset(BaseLlamaPredictionDataset):
    _prediction_type = PairwiseEvaluatorExamplePrediction

class LabelledPairwiseEvaluatorDataset(BaseLlamaDataset[BaseEvaluator]):
    _example_type = LabelledPairwiseEvaluatorDataExample

Import

from llama_index.core.llama_dataset.evaluator_evaluation import (
    EvaluatorExamplePrediction,
    LabelledEvaluatorDataExample,
    EvaluatorPredictionDataset,
    LabelledEvaluatorDataset,
    PairwiseEvaluatorExamplePrediction,
    LabelledPairwiseEvaluatorDataExample,
    PairwiseEvaluatorPredictionDataset,
    LabelledPairwiseEvaluatorDataset,
)

I/O Contract

Inputs

Name	Type	Required	Description
query	str	Yes (LabelledEvaluatorDataExample)	The user query for the evaluation example
answer	str	Yes	The response to be evaluated
contexts	Optional[List[str]]	No	Context strings used to generate the answer
ground_truth_answer	Optional[str]	No	The reference ground truth answer for comparison
reference_feedback	Optional[str]	No	The reference evaluator feedback (ground truth)
reference_score	float	Yes	The reference evaluator score (ground truth)
second_answer	str	Yes (Pairwise)	The second response for pairwise comparison
predictor	BaseEvaluator	Yes (for make_predictions_with)	The evaluator to benchmark
sleep_time_in_seconds	int	No	Delay between predictions to avoid rate limits

Outputs

Name	Type	Description
return (make_predictions_with)	EvaluatorPredictionDataset or PairwiseEvaluatorPredictionDataset	Dataset of evaluator predictions
feedback	str	The evaluator's textual feedback for an example
score	Optional[float]	The evaluator's numeric score for an example
invalid_prediction	bool	Whether the prediction encountered an error
invalid_reason	Optional[str]	Explanation of why a prediction is invalid
evaluation_source	Optional[EvaluationSource]	Whether the pairwise result came from original or flipped order
return (to_pandas)	pandas.DataFrame	DataFrame with evaluation results

Usage Examples

Basic Usage

from llama_index.core.llama_dataset.evaluator_evaluation import (
    LabelledEvaluatorDataExample,
    LabelledEvaluatorDataset,
    EvaluatorPredictionDataset,
)
from llama_index.core.llama_dataset.base import CreatedBy, CreatedByType

# Create a labelled evaluation example
example = LabelledEvaluatorDataExample(
    query="What is LlamaIndex?",
    query_by=CreatedBy(type=CreatedByType.HUMAN),
    answer="LlamaIndex is a data framework for LLMs.",
    answer_by=CreatedBy(type=CreatedByType.AI, model_name="gpt-4"),
    ground_truth_answer="LlamaIndex is a data framework for building LLM applications.",
    reference_feedback="The answer is mostly correct but incomplete.",
    reference_score=0.8,
    reference_evaluation_by=CreatedBy(type=CreatedByType.HUMAN),
)

# Create a dataset and generate predictions
dataset = LabelledEvaluatorDataset(examples=[example])

# Use with an evaluator
from llama_index.core.evaluation import CorrectnessEvaluator
evaluator = CorrectnessEvaluator()

predictions = dataset.make_predictions_with(evaluator, show_progress=True)
df = predictions.to_pandas()
print(df[["feedback", "score"]])

Pairwise Evaluation

from llama_index.core.llama_dataset.evaluator_evaluation import (
    LabelledPairwiseEvaluatorDataExample,
    LabelledPairwiseEvaluatorDataset,
)
from llama_index.core.llama_dataset.base import CreatedBy, CreatedByType

# Create a pairwise evaluation example
pairwise_example = LabelledPairwiseEvaluatorDataExample(
    query="What is LlamaIndex?",
    answer="LlamaIndex is a data framework.",
    second_answer="LlamaIndex is a Python library for building LLM applications.",
    ground_truth_answer="LlamaIndex is a data framework for building LLM apps.",
    reference_score=0.0,
    reference_feedback="The second answer is more complete.",
)

pairwise_dataset = LabelledPairwiseEvaluatorDataset(examples=[pairwise_example])

# Save and load
pairwise_dataset.save_json("pairwise_eval.json")
loaded = LabelledPairwiseEvaluatorDataset.from_json("pairwise_eval.json")

Related Pages

Environment:Run_llama_Llama_index_Python_LlamaIndex_Core

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment