Principle:Deepset ai Haystack Evaluation Result Reporting

Overview

Evaluation result reporting aggregates, compares, and exports metric scores from evaluation pipeline runs for analysis and comparison. It provides structured views of evaluation data at both aggregated and per-query levels, and enables A/B comparison between different pipeline configurations.

Domains

Evaluation
Analytics

Theoretical Foundation

Evaluation pipelines produce raw metric outputs from multiple evaluators. To be actionable, these outputs must be organized into structured reports that enable:

Aggregated Reporting

Summarizes overall performance across all queries for each metric:

| Metric     | Score |
|------------|-------|
| MRR        | 0.85  |
| MAP        | 0.72  |
| Recall     | 0.90  |
| SAS        | 0.88  |

This view answers: "How well does the pipeline perform overall?"

Detailed Reporting

Provides per-query scores alongside the inputs, enabling error analysis:

| Question            | MRR  | MAP  | Recall |
|---------------------|------|------|--------|
| "What is Python?"   | 1.0  | 0.95 | 1.0    |
| "Who wrote Java?"   | 0.5  | 0.60 | 0.5    |
| "What is C++?"      | 0.0  | 0.00 | 0.0    |

This view answers: "Which queries perform well and which fail?"

Comparative Reporting

Enables side-by-side comparison of two evaluation runs:

| Question          | Run_A_MRR | Run_B_MRR | Run_A_Recall | Run_B_Recall |
|-------------------|-----------|-----------|--------------|--------------|
| "What is Python?" | 1.0       | 0.5       | 1.0          | 1.0          |
| "Who wrote Java?" | 0.5       | 1.0       | 0.5          | 1.0          |

This view answers: "How do two pipeline configurations compare on the same data?"

Output Formats

Evaluation results must be exportable in multiple formats:

JSON: For programmatic consumption and further processing.
DataFrame: For interactive data analysis (pandas integration).
CSV: For sharing with non-technical stakeholders or importing into spreadsheets.

Design Principles

Immutability: Input data and results are deep-copied to prevent mutation.
Consistency: All input lists must have the same length. Individual score lists must match input length.
Validation: Missing scores or mismatched lengths are caught at initialization time with clear error messages.

When to Use Evaluation Result Reporting

Post-evaluation analysis: After running an evaluation pipeline, to understand results.
A/B testing: To compare different retriever, prompt, or generator configurations.
Reporting: To export results for documentation or stakeholder communication.
Regression testing: To compare current results against a baseline.

Limitations

Requires all evaluators to produce score and individual_scores in their output.
Comparative reports assume the same queries and same-length inputs across runs.
DataFrame output requires pandas as an additional dependency.

Relationship to Implementation

In the Haystack framework, this principle is realized by the EvaluationRunResult class, which:

Stores run name, inputs, and results from an evaluation pipeline.
Provides aggregated_report(), detailed_report(), and comparative_detailed_report() methods.
Supports JSON, DataFrame, and CSV output formats.

Related Principles

Evaluation Pipeline Execution -- the pipeline execution that produces the results consumed by this reporting mechanism.

Related Pages

Implemented By

Implementation:Deepset_ai_Haystack_EvaluationRunResult

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment