Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Deepset ai Haystack Evaluation Result Reporting

From Leeroopedia
Revision as of 18:15, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Deepset_ai_Haystack_Evaluation_Result_Reporting.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Evaluation result reporting aggregates, compares, and exports metric scores from evaluation pipeline runs for analysis and comparison. It provides structured views of evaluation data at both aggregated and per-query levels, and enables A/B comparison between different pipeline configurations.

Domains

  • Evaluation
  • Analytics

Theoretical Foundation

Evaluation pipelines produce raw metric outputs from multiple evaluators. To be actionable, these outputs must be organized into structured reports that enable:

Aggregated Reporting

Summarizes overall performance across all queries for each metric:

| Metric     | Score |
|------------|-------|
| MRR        | 0.85  |
| MAP        | 0.72  |
| Recall     | 0.90  |
| SAS        | 0.88  |

This view answers: "How well does the pipeline perform overall?"

Detailed Reporting

Provides per-query scores alongside the inputs, enabling error analysis:

| Question            | MRR  | MAP  | Recall |
|---------------------|------|------|--------|
| "What is Python?"   | 1.0  | 0.95 | 1.0    |
| "Who wrote Java?"   | 0.5  | 0.60 | 0.5    |
| "What is C++?"      | 0.0  | 0.00 | 0.0    |

This view answers: "Which queries perform well and which fail?"

Comparative Reporting

Enables side-by-side comparison of two evaluation runs:

| Question          | Run_A_MRR | Run_B_MRR | Run_A_Recall | Run_B_Recall |
|-------------------|-----------|-----------|--------------|--------------|
| "What is Python?" | 1.0       | 0.5       | 1.0          | 1.0          |
| "Who wrote Java?" | 0.5       | 1.0       | 0.5          | 1.0          |

This view answers: "How do two pipeline configurations compare on the same data?"

Output Formats

Evaluation results must be exportable in multiple formats:

  • JSON: For programmatic consumption and further processing.
  • DataFrame: For interactive data analysis (pandas integration).
  • CSV: For sharing with non-technical stakeholders or importing into spreadsheets.

Design Principles

  • Immutability: Input data and results are deep-copied to prevent mutation.
  • Consistency: All input lists must have the same length. Individual score lists must match input length.
  • Validation: Missing scores or mismatched lengths are caught at initialization time with clear error messages.

When to Use Evaluation Result Reporting

  • Post-evaluation analysis: After running an evaluation pipeline, to understand results.
  • A/B testing: To compare different retriever, prompt, or generator configurations.
  • Reporting: To export results for documentation or stakeholder communication.
  • Regression testing: To compare current results against a baseline.

Limitations

  • Requires all evaluators to produce score and individual_scores in their output.
  • Comparative reports assume the same queries and same-length inputs across runs.
  • DataFrame output requires pandas as an additional dependency.

Relationship to Implementation

In the Haystack framework, this principle is realized by the EvaluationRunResult class, which:

  • Stores run name, inputs, and results from an evaluation pipeline.
  • Provides aggregated_report(), detailed_report(), and comparative_detailed_report() methods.
  • Supports JSON, DataFrame, and CSV output formats.

Related Principles

  • Evaluation Pipeline Execution -- the pipeline execution that produces the results consumed by this reporting mechanism.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment