Implementation:Deepset ai Haystack EvaluationRunResult

Overview

EvaluationRunResult is a Haystack class that contains the inputs and outputs of an evaluation pipeline run and provides methods to generate aggregated, detailed, and comparative reports in multiple formats.

Implements Principle

Principle:Deepset_ai_Haystack_Evaluation_Result_Reporting

Source Location

haystack/evaluation/eval_run_result.py (Lines 18-229)

Import

from haystack.evaluation import EvaluationRunResult

Dependencies

pandas (optional) -- Required only for DataFrame output format. Install via: pip install pandas
csv (standard library) -- Used for CSV output.

API

Constructor

def __init__(
    self,
    run_name: str,
    inputs: dict[str, list[Any]],
    results: dict[str, dict[str, Any]]
):

Parameters:

run_name (str) -- Name of the evaluation run (used as identifier in comparative reports).
inputs (dict[str, list[Any]]) -- Dictionary of inputs used for the run. Each key is an input name and its value is a list of input values. All lists must have the same length.
results (dict[str, dict[str, Any]]) -- Dictionary of evaluator results. Each key is a metric name and its value is a dictionary with:
- score (float) -- The aggregated score for the metric.
- individual_scores (list) -- A list of scores for each input sample. Must match the length of input lists.

Raises:

ValueError -- If no inputs are provided, input list lengths differ, aggregate score is missing, individual scores are missing, or individual score length does not match input length.

Attributes

run_name (str) -- Name of the evaluation run.
inputs (dict) -- Deep copy of the provided inputs.
results (dict) -- Deep copy of the provided results.

aggregated_report()

def aggregated_report(
    self,
    output_format: Literal["json", "csv", "df"] = "json",
    csv_file: str | None = None
) -> dict[str, list[Any]] | DataFrame | str:

Generates a report with aggregated scores for each metric.

Parameters:

output_format (str, default: "json") -- Output format: "json", "csv", or "df".
csv_file (str | None) -- File path for CSV output. Required when output_format="csv".

Returns:

JSON format: {"metrics": [...], "score": [...]}
DataFrame format: A pandas DataFrame with metrics and scores.
CSV format: A string message confirming successful write.

detailed_report()

def detailed_report(
    self,
    output_format: Literal["json", "csv", "df"] = "json",
    csv_file: str | None = None
) -> dict[str, list[Any]] | DataFrame | str:

Generates a report with per-query scores for each metric, alongside the input data.

Parameters: Same as aggregated_report().

Returns: A combined dictionary/DataFrame/CSV containing input columns and individual score columns for each metric.

comparative_detailed_report()

def comparative_detailed_report(
    self,
    other: EvaluationRunResult,
    keep_columns: list[str] | None = None,
    output_format: Literal["json", "csv", "df"] = "json",
    csv_file: str | None = None
) -> dict | DataFrame | str:

Generates a side-by-side comparison of two evaluation runs.

Parameters:

other (EvaluationRunResult) -- Results of another evaluation run to compare with.
keep_columns (list[str] | None) -- List of common input column names to keep from both runs. If None, all input columns from the first run are included (but not duplicated from the second).
output_format (str, default: "json") -- Output format.
csv_file (str | None) -- File path for CSV output.

Returns: Combined data with columns prefixed by run names (e.g., run_a_mrr, run_b_mrr).

Raises:

ValueError -- If other is not an EvaluationRunResult or is missing required attributes.

Warnings:

Logs a warning if the two run names are identical.
Logs a warning if the input columns differ between runs.

Usage Example

from haystack.evaluation import EvaluationRunResult

# Create from evaluation pipeline outputs
run_result = EvaluationRunResult(
    run_name="rag_pipeline_v1",
    inputs={
        "questions": ["What is Python?", "Who created Java?", "What is C++?"],
    },
    results={
        "mrr": {"score": 0.833, "individual_scores": [1.0, 0.5, 1.0]},
        "recall": {"score": 0.667, "individual_scores": [1.0, 0.0, 1.0]},
    },
)

# Aggregated view
agg = run_result.aggregated_report()
print(agg)
# {"metrics": ["mrr", "recall"], "score": [0.833, 0.667]}

# Detailed view
detail = run_result.detailed_report()
print(detail)
# {"questions": [...], "mrr": [1.0, 0.5, 1.0], "recall": [1.0, 0.0, 1.0]}

# Export to CSV
run_result.detailed_report(output_format="csv", csv_file="results.csv")

# Export to DataFrame
df = run_result.detailed_report(output_format="df")

Comparative Example

from haystack.evaluation import EvaluationRunResult

run_a = EvaluationRunResult(
    run_name="bm25_retriever",
    inputs={"questions": ["Q1", "Q2"]},
    results={"mrr": {"score": 0.75, "individual_scores": [1.0, 0.5]}},
)
run_b = EvaluationRunResult(
    run_name="embedding_retriever",
    inputs={"questions": ["Q1", "Q2"]},
    results={"mrr": {"score": 0.90, "individual_scores": [1.0, 0.8]}},
)

comparison = run_a.comparative_detailed_report(run_b, keep_columns=["questions"])
print(comparison)
# {"questions": ["Q1", "Q2"],
#  "bm25_retriever_mrr": [1.0, 0.5],
#  "embedding_retriever_mrr": [1.0, 0.8]}

Internal Design

Data Immutability

Inputs and results are deep-copied at initialization time to prevent external mutation from affecting the stored data.

Type Consistency

When generating detailed reports, if any value in an individual scores column is a float, all values in that column are cast to float for consistency.

Output Handling

The internal _handle_output() method routes formatted data to the appropriate output:

json: Returns the dictionary directly.
df: Converts to a pandas DataFrame (requires pandas).
csv: Writes to a file using the _write_to_csv() static method.

Important Notes

Validation at init: All validation (matching lengths, required keys) happens at construction time. This fail-fast approach prevents runtime surprises.
pandas is optional: The pandas library is only required when using output_format="df". It is lazy-imported.
Column naming in comparisons: In comparative reports, metric columns are prefixed with the run name (e.g., run_name_metric).
No pipeline coupling: EvaluationRunResult is a standalone data container. It does not depend on Pipeline or any evaluator component.

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_Evaluation_Result_Reporting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment