Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index EvaluatorEvaluationDataset

From Leeroopedia
Knowledge Sources
Domains Evaluation, Datasets, Benchmarking
Last Updated 2026-02-11 19:00 GMT

Overview

This module implements the evaluator evaluation dataset system for LlamaIndex, providing concrete classes for labelled evaluation data examples, evaluation predictions, and both standard and pairwise evaluator datasets used to benchmark evaluator quality.

Description

The evaluator_evaluation.py module builds on the base dataset abstractions to provide a complete framework for evaluating the quality of LlamaIndex evaluators themselves. It defines two evaluation paradigms: standard evaluation and pairwise evaluation.

Standard Evaluation Classes:

EvaluatorExamplePrediction extends BaseLlamaExamplePrediction and stores the output of running an evaluator on a single example. It includes feedback (the evaluator's textual feedback), score (a numeric score), invalid_prediction (a boolean flag), and invalid_reason (explanation if the prediction failed).

LabelledEvaluatorDataExample extends BaseLlamaDataExample and represents a single labelled evaluation example with rich metadata. It contains query, query_by (a CreatedBy instance), contexts (optional list of context strings), answer (the response to be evaluated), answer_by, ground_truth_answer, ground_truth_answer_by, reference_feedback, reference_score, and reference_evaluation_by. This structure allows comparing an evaluator's output against human or AI reference evaluations.

EvaluatorPredictionDataset extends BaseLlamaPredictionDataset and provides to_pandas for converting predictions into a DataFrame with feedback and score columns.

LabelledEvaluatorDataset extends BaseLlamaDataset[BaseEvaluator] and is the primary dataset class for standard evaluator benchmarking. It implements _predict_example and _apredict_example which call the evaluator's evaluate and aevaluate methods respectively, passing the query, answer, contexts, and ground truth answer. Both methods handle exceptions gracefully by returning an EvaluatorExamplePrediction with invalid_prediction=True. The to_pandas method converts all examples to a DataFrame with columns for all metadata fields.

Pairwise Evaluation Classes:

PairwiseEvaluatorExamplePrediction adds an evaluation_source field of type EvaluationSource to track whether the evaluation result came from the original or flipped ordering of responses.

LabelledPairwiseEvaluatorDataExample extends LabelledEvaluatorDataExample with a second_answer and second_answer_by field for the comparison response.

PairwiseEvaluatorPredictionDataset provides to_pandas with columns for feedback, score, and ordering.

LabelledPairwiseEvaluatorDataset implements prediction methods that pass both response and second_response to the evaluator's evaluate method, and captures the pairwise_source from the evaluation result.

The module also provides American English aliases: LabeledEvaluatorDataExample, LabeledEvaluatorDataset, LabeledPairwiseEvaluatorDataExample, and LabeledPairwiseEvaluatorDataset.

Usage

Use this module to benchmark evaluator quality by comparing evaluator outputs against reference evaluations. Use LabelledEvaluatorDataset for standard single-response evaluation and LabelledPairwiseEvaluatorDataset for comparing two responses. Load datasets from JSON, run evaluators against them with make_predictions_with or amake_predictions_with, and analyze results with to_pandas.

Code Reference

Source Location

  • Repository: Run_llama_Llama_index
  • File: llama-index-core/llama_index/core/llama_dataset/evaluator_evaluation.py
  • Lines: 1-499

Signature

class EvaluatorExamplePrediction(BaseLlamaExamplePrediction):
    feedback: str = Field(default_factory=str)
    score: Optional[float] = Field(default=None)
    invalid_prediction: bool = Field(default=False)
    invalid_reason: Optional[str] = Field(default=None)

class LabelledEvaluatorDataExample(BaseLlamaDataExample):
    query: str = Field(default_factory=str)
    query_by: Optional[CreatedBy] = Field(default=None)
    contexts: Optional[List[str]] = Field(default=None)
    answer: str = Field(default_factory=str)
    answer_by: Optional[CreatedBy] = Field(default=None)
    ground_truth_answer: Optional[str] = Field(default=None)
    ground_truth_answer_by: Optional[CreatedBy] = Field(default=None)
    reference_feedback: Optional[str] = Field(default=None)
    reference_score: float = Field(default_factory=float)
    reference_evaluation_by: Optional[CreatedBy] = Field(default=None)

class EvaluatorPredictionDataset(BaseLlamaPredictionDataset):
    _prediction_type = EvaluatorExamplePrediction

class LabelledEvaluatorDataset(BaseLlamaDataset[BaseEvaluator]):
    _example_type = LabelledEvaluatorDataExample

class PairwiseEvaluatorExamplePrediction(BaseLlamaExamplePrediction):
    feedback: str = Field(default_factory=str)
    score: Optional[float] = Field(default=None)
    evaluation_source: Optional[EvaluationSource] = Field(default=None)
    invalid_prediction: bool = Field(default=False)
    invalid_reason: Optional[str] = Field(default=None)

class LabelledPairwiseEvaluatorDataExample(LabelledEvaluatorDataExample):
    second_answer: str = Field(default_factory=str)
    second_answer_by: Optional[CreatedBy] = Field(default=None)

class PairwiseEvaluatorPredictionDataset(BaseLlamaPredictionDataset):
    _prediction_type = PairwiseEvaluatorExamplePrediction

class LabelledPairwiseEvaluatorDataset(BaseLlamaDataset[BaseEvaluator]):
    _example_type = LabelledPairwiseEvaluatorDataExample

Import

from llama_index.core.llama_dataset.evaluator_evaluation import (
    EvaluatorExamplePrediction,
    LabelledEvaluatorDataExample,
    EvaluatorPredictionDataset,
    LabelledEvaluatorDataset,
    PairwiseEvaluatorExamplePrediction,
    LabelledPairwiseEvaluatorDataExample,
    PairwiseEvaluatorPredictionDataset,
    LabelledPairwiseEvaluatorDataset,
)

I/O Contract

Inputs

Name Type Required Description
query str Yes (LabelledEvaluatorDataExample) The user query for the evaluation example
answer str Yes The response to be evaluated
contexts Optional[List[str]] No Context strings used to generate the answer
ground_truth_answer Optional[str] No The reference ground truth answer for comparison
reference_feedback Optional[str] No The reference evaluator feedback (ground truth)
reference_score float Yes The reference evaluator score (ground truth)
second_answer str Yes (Pairwise) The second response for pairwise comparison
predictor BaseEvaluator Yes (for make_predictions_with) The evaluator to benchmark
sleep_time_in_seconds int No Delay between predictions to avoid rate limits

Outputs

Name Type Description
return (make_predictions_with) EvaluatorPredictionDataset or PairwiseEvaluatorPredictionDataset Dataset of evaluator predictions
feedback str The evaluator's textual feedback for an example
score Optional[float] The evaluator's numeric score for an example
invalid_prediction bool Whether the prediction encountered an error
invalid_reason Optional[str] Explanation of why a prediction is invalid
evaluation_source Optional[EvaluationSource] Whether the pairwise result came from original or flipped order
return (to_pandas) pandas.DataFrame DataFrame with evaluation results

Usage Examples

Basic Usage

from llama_index.core.llama_dataset.evaluator_evaluation import (
    LabelledEvaluatorDataExample,
    LabelledEvaluatorDataset,
    EvaluatorPredictionDataset,
)
from llama_index.core.llama_dataset.base import CreatedBy, CreatedByType

# Create a labelled evaluation example
example = LabelledEvaluatorDataExample(
    query="What is LlamaIndex?",
    query_by=CreatedBy(type=CreatedByType.HUMAN),
    answer="LlamaIndex is a data framework for LLMs.",
    answer_by=CreatedBy(type=CreatedByType.AI, model_name="gpt-4"),
    ground_truth_answer="LlamaIndex is a data framework for building LLM applications.",
    reference_feedback="The answer is mostly correct but incomplete.",
    reference_score=0.8,
    reference_evaluation_by=CreatedBy(type=CreatedByType.HUMAN),
)

# Create a dataset and generate predictions
dataset = LabelledEvaluatorDataset(examples=[example])

# Use with an evaluator
from llama_index.core.evaluation import CorrectnessEvaluator
evaluator = CorrectnessEvaluator()

predictions = dataset.make_predictions_with(evaluator, show_progress=True)
df = predictions.to_pandas()
print(df[["feedback", "score"]])

Pairwise Evaluation

from llama_index.core.llama_dataset.evaluator_evaluation import (
    LabelledPairwiseEvaluatorDataExample,
    LabelledPairwiseEvaluatorDataset,
)
from llama_index.core.llama_dataset.base import CreatedBy, CreatedByType

# Create a pairwise evaluation example
pairwise_example = LabelledPairwiseEvaluatorDataExample(
    query="What is LlamaIndex?",
    answer="LlamaIndex is a data framework.",
    second_answer="LlamaIndex is a Python library for building LLM applications.",
    ground_truth_answer="LlamaIndex is a data framework for building LLM apps.",
    reference_score=0.0,
    reference_feedback="The second answer is more complete.",
)

pairwise_dataset = LabelledPairwiseEvaluatorDataset(examples=[pairwise_example])

# Save and load
pairwise_dataset.save_json("pairwise_eval.json")
loaded = LabelledPairwiseEvaluatorDataset.from_json("pairwise_eval.json")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment