Implementation:Deepset ai Haystack Pipeline Run For Evaluation

Overview

Pipeline.run() is the standard Haystack pipeline execution method, used here from the evaluation perspective. When a pipeline is composed exclusively of evaluator components (with no inter-component connections), Pipeline.run() distributes input data to each evaluator and collects all metric outputs in a single pass.

Implements Principle

Principle:Deepset_ai_Haystack_Evaluation_Pipeline_Execution

Source Location

haystack/core/pipeline/pipeline.py (Lines 109-447)

This is the same Pipeline.run() API used for inference pipelines, documented here from the evaluation context.

Import

from haystack import Pipeline

API

Pipeline Construction for Evaluation

from haystack import Pipeline
from haystack.components.evaluators import (
    DocumentMRREvaluator,
    DocumentMAPEvaluator,
    DocumentRecallEvaluator,
    FaithfulnessEvaluator,
    ContextRelevanceEvaluator,
    SASEvaluator,
)

eval_pipeline = Pipeline()
eval_pipeline.add_component("mrr", DocumentMRREvaluator())
eval_pipeline.add_component("map", DocumentMAPEvaluator())
eval_pipeline.add_component("recall", DocumentRecallEvaluator(mode="multi_hit"))
eval_pipeline.add_component("faithfulness", FaithfulnessEvaluator())
eval_pipeline.add_component("context_relevance", ContextRelevanceEvaluator())
eval_pipeline.add_component("sas", SASEvaluator())

Key observation: No connect() calls are needed for evaluation pipelines. Each evaluator is an independent leaf component.

run()

def run(
    self,
    data: dict[str, Any],
    include_outputs_from: set[str] | None = None,
) -> dict[str, Any]:

Parameters:

data (dict[str, Any]) -- A dictionary of inputs for the pipeline's components. Each key is a component name and its value is a dictionary of that component's input parameters.
include_outputs_from (set[str] | None) -- Set of component names whose individual outputs should be included in the pipeline output. For evaluation pipelines, this is typically not needed since all evaluators are leaf components whose outputs are automatically included.

Returns: A dictionary where each entry corresponds to a component name and its output dictionary.

Input Data Format

For evaluation pipelines, input data is keyed by evaluator component name:

data = {
    "mrr": {
        "ground_truth_documents": ground_truth_docs,
        "retrieved_documents": retrieved_docs,
    },
    "map": {
        "ground_truth_documents": ground_truth_docs,
        "retrieved_documents": retrieved_docs,
    },
    "recall": {
        "ground_truth_documents": ground_truth_docs,
        "retrieved_documents": retrieved_docs,
    },
    "faithfulness": {
        "questions": questions,
        "contexts": contexts,
        "predicted_answers": predicted_answers,
    },
    "context_relevance": {
        "questions": questions,
        "contexts": contexts,
    },
    "sas": {
        "ground_truth_answers": ground_truth_answers,
        "predicted_answers": predicted_answers,
    },
}

Output Format

{
    "mrr": {"score": 0.85, "individual_scores": [1.0, 0.5, 1.0]},
    "map": {"score": 0.72, "individual_scores": [0.9, 0.5, 0.75]},
    "recall": {"score": 0.67, "individual_scores": [1.0, 0.0, 1.0]},
    "faithfulness": {"score": 0.80, "individual_scores": [1.0, 0.6], "results": [...]},
    "context_relevance": {"score": 0.67, "individual_scores": [1, 1, 0], "results": [...]},
    "sas": {"score": 0.92, "individual_scores": [0.95, 0.89, 0.92]},
}

Complete Evaluation Workflow Example

from haystack import Document, Pipeline
from haystack.components.evaluators import (
    DocumentMRREvaluator,
    DocumentMAPEvaluator,
    DocumentRecallEvaluator,
)
from haystack.evaluation import EvaluationRunResult

# 1. Build the evaluation pipeline
eval_pipeline = Pipeline()
eval_pipeline.add_component("mrr", DocumentMRREvaluator())
eval_pipeline.add_component("map", DocumentMAPEvaluator())
eval_pipeline.add_component("recall", DocumentRecallEvaluator(mode="multi_hit"))

# 2. Prepare evaluation data
ground_truths = [
    [Document(content="Paris is the capital of France.")],
    [Document(content="Berlin was founded in 1244."), Document(content="Berlin is the capital of Germany.")],
]
retrieved = [
    [Document(content="Paris is the capital of France."), Document(content="Lyon is a city in France.")],
    [Document(content="Berlin is the capital of Germany."), Document(content="Munich is in Bavaria.")],
]

# 3. Run the evaluation pipeline
results = eval_pipeline.run({
    "mrr": {"ground_truth_documents": ground_truths, "retrieved_documents": retrieved},
    "map": {"ground_truth_documents": ground_truths, "retrieved_documents": retrieved},
    "recall": {"ground_truth_documents": ground_truths, "retrieved_documents": retrieved},
})

# 4. Create an EvaluationRunResult for reporting
eval_result = EvaluationRunResult(
    run_name="retrieval_eval_v1",
    inputs={"questions": ["What is the capital of France?", "Tell me about Berlin."]},
    results={
        "mrr": results["mrr"],
        "map": results["map"],
        "recall": results["recall"],
    },
)

# 5. Generate reports
print(eval_result.aggregated_report())
print(eval_result.detailed_report())

How Pipeline.run() Handles Evaluation

Component Execution Order

Since evaluator components have no inter-component connections, the pipeline treats them as independent leaf nodes. The execution order is determined alphabetically by component name for determinism.

Input Preparation

The _prepare_component_input_data() method normalizes the input dictionary. When inputs are keyed by component name (the standard evaluation format), each component receives only its designated inputs.

Input Validation

The validate_input() method verifies that:

All provided component names exist in the pipeline.
All required inputs for each component are present.
Input types match the expected signatures.

Output Collection

For evaluation pipelines (no inter-component connections), all components are leaf nodes. Their outputs are automatically collected into the pipeline_outputs dictionary, keyed by component name.

Warm-up

Pipeline.run() calls warm_up() before execution, which initializes components that need it (e.g., SASEvaluator loading its model).

Important Notes

Same API as inference: Evaluation uses the exact same Pipeline.run() API as inference. The only difference is the types of components in the pipeline.
No connections needed: Evaluators are independent components. Do not call connect() between them.
Explicit data routing: Each evaluator must receive its inputs explicitly in the data dictionary. There is no automatic data sharing between evaluators.
LLM evaluator warm-up: Components like SASEvaluator require warm_up() to load models. This is called automatically by Pipeline.run().
Error handling: If any evaluator fails, a PipelineRuntimeError is raised with the component name and error details.
Tracing support: Pipeline execution is traced, providing visibility into per-component timing and inputs/outputs.

Dependencies

haystack core library (Pipeline, Component, tracing)
Individual evaluator dependencies (see each evaluator's documentation)

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_Evaluation_Pipeline_Execution

Requires Environment

Environment:Deepset_ai_Haystack_Python_Runtime_Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment