Principle:Deepset ai Haystack Evaluation Result Reporting
Overview
Evaluation result reporting aggregates, compares, and exports metric scores from evaluation pipeline runs for analysis and comparison. It provides structured views of evaluation data at both aggregated and per-query levels, and enables A/B comparison between different pipeline configurations.
Domains
- Evaluation
- Analytics
Theoretical Foundation
Evaluation pipelines produce raw metric outputs from multiple evaluators. To be actionable, these outputs must be organized into structured reports that enable:
Aggregated Reporting
Summarizes overall performance across all queries for each metric:
| Metric | Score |
|------------|-------|
| MRR | 0.85 |
| MAP | 0.72 |
| Recall | 0.90 |
| SAS | 0.88 |
This view answers: "How well does the pipeline perform overall?"
Detailed Reporting
Provides per-query scores alongside the inputs, enabling error analysis:
| Question | MRR | MAP | Recall |
|---------------------|------|------|--------|
| "What is Python?" | 1.0 | 0.95 | 1.0 |
| "Who wrote Java?" | 0.5 | 0.60 | 0.5 |
| "What is C++?" | 0.0 | 0.00 | 0.0 |
This view answers: "Which queries perform well and which fail?"
Comparative Reporting
Enables side-by-side comparison of two evaluation runs:
| Question | Run_A_MRR | Run_B_MRR | Run_A_Recall | Run_B_Recall |
|-------------------|-----------|-----------|--------------|--------------|
| "What is Python?" | 1.0 | 0.5 | 1.0 | 1.0 |
| "Who wrote Java?" | 0.5 | 1.0 | 0.5 | 1.0 |
This view answers: "How do two pipeline configurations compare on the same data?"
Output Formats
Evaluation results must be exportable in multiple formats:
- JSON: For programmatic consumption and further processing.
- DataFrame: For interactive data analysis (pandas integration).
- CSV: For sharing with non-technical stakeholders or importing into spreadsheets.
Design Principles
- Immutability: Input data and results are deep-copied to prevent mutation.
- Consistency: All input lists must have the same length. Individual score lists must match input length.
- Validation: Missing scores or mismatched lengths are caught at initialization time with clear error messages.
When to Use Evaluation Result Reporting
- Post-evaluation analysis: After running an evaluation pipeline, to understand results.
- A/B testing: To compare different retriever, prompt, or generator configurations.
- Reporting: To export results for documentation or stakeholder communication.
- Regression testing: To compare current results against a baseline.
Limitations
- Requires all evaluators to produce
scoreandindividual_scoresin their output. - Comparative reports assume the same queries and same-length inputs across runs.
- DataFrame output requires pandas as an additional dependency.
Relationship to Implementation
In the Haystack framework, this principle is realized by the EvaluationRunResult class, which:
- Stores run name, inputs, and results from an evaluation pipeline.
- Provides
aggregated_report(),detailed_report(), andcomparative_detailed_report()methods. - Supports JSON, DataFrame, and CSV output formats.
Related Principles
- Evaluation Pipeline Execution -- the pipeline execution that produces the results consumed by this reporting mechanism.