Implementation:Deepset ai Haystack Pipeline Run For Evaluation
Overview
Pipeline.run() is the standard Haystack pipeline execution method, used here from the evaluation perspective. When a pipeline is composed exclusively of evaluator components (with no inter-component connections), Pipeline.run() distributes input data to each evaluator and collects all metric outputs in a single pass.
Implements Principle
Principle:Deepset_ai_Haystack_Evaluation_Pipeline_Execution
Source Location
haystack/core/pipeline/pipeline.py (Lines 109-447)
This is the same Pipeline.run() API used for inference pipelines, documented here from the evaluation context.
Import
from haystack import Pipeline
API
Pipeline Construction for Evaluation
from haystack import Pipeline
from haystack.components.evaluators import (
DocumentMRREvaluator,
DocumentMAPEvaluator,
DocumentRecallEvaluator,
FaithfulnessEvaluator,
ContextRelevanceEvaluator,
SASEvaluator,
)
eval_pipeline = Pipeline()
eval_pipeline.add_component("mrr", DocumentMRREvaluator())
eval_pipeline.add_component("map", DocumentMAPEvaluator())
eval_pipeline.add_component("recall", DocumentRecallEvaluator(mode="multi_hit"))
eval_pipeline.add_component("faithfulness", FaithfulnessEvaluator())
eval_pipeline.add_component("context_relevance", ContextRelevanceEvaluator())
eval_pipeline.add_component("sas", SASEvaluator())
Key observation: No connect() calls are needed for evaluation pipelines. Each evaluator is an independent leaf component.
run()
def run(
self,
data: dict[str, Any],
include_outputs_from: set[str] | None = None,
) -> dict[str, Any]:
Parameters:
- data (
dict[str, Any]) -- A dictionary of inputs for the pipeline's components. Each key is a component name and its value is a dictionary of that component's input parameters. - include_outputs_from (
set[str] | None) -- Set of component names whose individual outputs should be included in the pipeline output. For evaluation pipelines, this is typically not needed since all evaluators are leaf components whose outputs are automatically included.
Returns: A dictionary where each entry corresponds to a component name and its output dictionary.
Input Data Format
For evaluation pipelines, input data is keyed by evaluator component name:
data = {
"mrr": {
"ground_truth_documents": ground_truth_docs,
"retrieved_documents": retrieved_docs,
},
"map": {
"ground_truth_documents": ground_truth_docs,
"retrieved_documents": retrieved_docs,
},
"recall": {
"ground_truth_documents": ground_truth_docs,
"retrieved_documents": retrieved_docs,
},
"faithfulness": {
"questions": questions,
"contexts": contexts,
"predicted_answers": predicted_answers,
},
"context_relevance": {
"questions": questions,
"contexts": contexts,
},
"sas": {
"ground_truth_answers": ground_truth_answers,
"predicted_answers": predicted_answers,
},
}
Output Format
{
"mrr": {"score": 0.85, "individual_scores": [1.0, 0.5, 1.0]},
"map": {"score": 0.72, "individual_scores": [0.9, 0.5, 0.75]},
"recall": {"score": 0.67, "individual_scores": [1.0, 0.0, 1.0]},
"faithfulness": {"score": 0.80, "individual_scores": [1.0, 0.6], "results": [...]},
"context_relevance": {"score": 0.67, "individual_scores": [1, 1, 0], "results": [...]},
"sas": {"score": 0.92, "individual_scores": [0.95, 0.89, 0.92]},
}
Complete Evaluation Workflow Example
from haystack import Document, Pipeline
from haystack.components.evaluators import (
DocumentMRREvaluator,
DocumentMAPEvaluator,
DocumentRecallEvaluator,
)
from haystack.evaluation import EvaluationRunResult
# 1. Build the evaluation pipeline
eval_pipeline = Pipeline()
eval_pipeline.add_component("mrr", DocumentMRREvaluator())
eval_pipeline.add_component("map", DocumentMAPEvaluator())
eval_pipeline.add_component("recall", DocumentRecallEvaluator(mode="multi_hit"))
# 2. Prepare evaluation data
ground_truths = [
[Document(content="Paris is the capital of France.")],
[Document(content="Berlin was founded in 1244."), Document(content="Berlin is the capital of Germany.")],
]
retrieved = [
[Document(content="Paris is the capital of France."), Document(content="Lyon is a city in France.")],
[Document(content="Berlin is the capital of Germany."), Document(content="Munich is in Bavaria.")],
]
# 3. Run the evaluation pipeline
results = eval_pipeline.run({
"mrr": {"ground_truth_documents": ground_truths, "retrieved_documents": retrieved},
"map": {"ground_truth_documents": ground_truths, "retrieved_documents": retrieved},
"recall": {"ground_truth_documents": ground_truths, "retrieved_documents": retrieved},
})
# 4. Create an EvaluationRunResult for reporting
eval_result = EvaluationRunResult(
run_name="retrieval_eval_v1",
inputs={"questions": ["What is the capital of France?", "Tell me about Berlin."]},
results={
"mrr": results["mrr"],
"map": results["map"],
"recall": results["recall"],
},
)
# 5. Generate reports
print(eval_result.aggregated_report())
print(eval_result.detailed_report())
How Pipeline.run() Handles Evaluation
Component Execution Order
Since evaluator components have no inter-component connections, the pipeline treats them as independent leaf nodes. The execution order is determined alphabetically by component name for determinism.
Input Preparation
The _prepare_component_input_data() method normalizes the input dictionary. When inputs are keyed by component name (the standard evaluation format), each component receives only its designated inputs.
Input Validation
The validate_input() method verifies that:
- All provided component names exist in the pipeline.
- All required inputs for each component are present.
- Input types match the expected signatures.
Output Collection
For evaluation pipelines (no inter-component connections), all components are leaf nodes. Their outputs are automatically collected into the pipeline_outputs dictionary, keyed by component name.
Warm-up
Pipeline.run() calls warm_up() before execution, which initializes components that need it (e.g., SASEvaluator loading its model).
Important Notes
- Same API as inference: Evaluation uses the exact same
Pipeline.run()API as inference. The only difference is the types of components in the pipeline. - No connections needed: Evaluators are independent components. Do not call
connect()between them. - Explicit data routing: Each evaluator must receive its inputs explicitly in the data dictionary. There is no automatic data sharing between evaluators.
- LLM evaluator warm-up: Components like
SASEvaluatorrequirewarm_up()to load models. This is called automatically byPipeline.run(). - Error handling: If any evaluator fails, a
PipelineRuntimeErroris raised with the component name and error details. - Tracing support: Pipeline execution is traced, providing visibility into per-component timing and inputs/outputs.
Dependencies
haystackcore library (Pipeline,Component, tracing)- Individual evaluator dependencies (see each evaluator's documentation)