Principle:Deepset ai Haystack Evaluation Pipeline Execution

Overview

Evaluation pipeline execution runs multiple evaluator components in a single pipeline pass, feeding ground truth data and predictions to each metric simultaneously. This pattern enables efficient, consistent, and reproducible evaluation of retrieval and generation systems.

Domains

Evaluation
Workflow_Orchestration

Theoretical Foundation

Evaluating a RAG pipeline requires computing multiple metrics (MRR, MAP, Recall, Faithfulness, etc.) on the same set of inputs. Rather than running each evaluator independently, the evaluation pipeline pattern orchestrates all evaluators in a single execution pass.

Parallel Evaluation Pattern

The evaluation pipeline distributes shared data to multiple independent evaluator components:

                    +---> MRR Evaluator ---> MRR scores
                    |
Input Data --------+---> MAP Evaluator ---> MAP scores
(ground truths,    |
 predictions,      +---> Recall Evaluator ---> Recall scores
 contexts)         |
                    +---> Faithfulness Evaluator ---> Faithfulness scores

Each evaluator is an independent component with no connections to other evaluators. The pipeline orchestrator ensures:

All evaluators receive the correct input data.
Evaluators run in an efficient order (independent components can conceptually run in parallel).
All results are collected and returned in a unified output structure.

Input Structure

Inputs to the evaluation pipeline are structured as per-evaluator dictionaries:

{
    "evaluator_1": {"ground_truth_documents": [...], "retrieved_documents": [...]},
    "evaluator_2": {"ground_truth_answers": [...], "predicted_answers": [...]},
    "evaluator_3": {"questions": [...], "contexts": [...], "predicted_answers": [...]},
}

This explicit mapping ensures each evaluator receives exactly the inputs it expects, even when different evaluators require different input formats.

Output Structure

The pipeline returns outputs keyed by evaluator component name:

{
    "evaluator_1": {"score": 0.85, "individual_scores": [...]},
    "evaluator_2": {"score": 0.72, "individual_scores": [...]},
    "evaluator_3": {"score": 0.90, "individual_scores": [...], "results": [...]},
}

Pipeline as Evaluation Framework

Using the standard Pipeline for evaluation provides several advantages:

Consistency: The same orchestration engine used for inference handles evaluation.
Composability: Evaluators can be mixed and matched freely.
Serialization: Evaluation pipelines can be saved, loaded, and shared.
Tracing: Built-in tracing captures per-component timing and inputs/outputs.

Separation of Concerns

The evaluation pipeline pattern separates:

Data preparation: Collecting ground truth, predictions, and contexts.
Metric computation: Each evaluator focuses on one specific metric.
Result aggregation: The EvaluationRunResult class handles reporting.

This separation makes it easy to add new metrics, change evaluation data, or adjust pipeline structure without affecting other components.

When to Use Evaluation Pipeline Execution

Multi-metric evaluation: When computing several metrics on the same data.
Standardized evaluation workflows: When evaluation should be reproducible and shareable.
CI/CD integration: When evaluation is part of an automated pipeline validation process.
Evaluation experiments: When comparing different evaluation metric combinations.

Limitations

Each evaluator must be independently addressed in the input dictionary -- there is no automatic input broadcasting.
The same data may need to be provided to multiple evaluators explicitly (no implicit sharing).
LLM-based evaluators (Faithfulness, Context Relevance) incur API costs and latency.

Relationship to Implementation

In the Haystack framework, this principle is realized by the standard Pipeline.run() method, the same API used for inference pipelines:

Evaluator components are added to the pipeline with add_component().
Since evaluators have no inter-component connections, they run as independent leaf nodes.
The run() method accepts a dictionary keyed by component names.
Results from all evaluators are collected in the pipeline output.

Related Principles

Evaluation Result Reporting -- consumes the outputs of evaluation pipeline execution for structured reporting.
Retrieval MRR Evaluation, Retrieval MAP Evaluation, Retrieval Recall Evaluation -- individual metric principles computed within the pipeline.
Faithfulness Evaluation, Context Relevance Evaluation -- LLM-based metrics computed within the pipeline.

Related Pages

Implemented By

Implementation:Deepset_ai_Haystack_Pipeline_Run_For_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment