Principle:Deepset ai Haystack Evaluation Pipeline Execution
Overview
Evaluation pipeline execution runs multiple evaluator components in a single pipeline pass, feeding ground truth data and predictions to each metric simultaneously. This pattern enables efficient, consistent, and reproducible evaluation of retrieval and generation systems.
Domains
- Evaluation
- Workflow_Orchestration
Theoretical Foundation
Evaluating a RAG pipeline requires computing multiple metrics (MRR, MAP, Recall, Faithfulness, etc.) on the same set of inputs. Rather than running each evaluator independently, the evaluation pipeline pattern orchestrates all evaluators in a single execution pass.
Parallel Evaluation Pattern
The evaluation pipeline distributes shared data to multiple independent evaluator components:
+---> MRR Evaluator ---> MRR scores
|
Input Data --------+---> MAP Evaluator ---> MAP scores
(ground truths, |
predictions, +---> Recall Evaluator ---> Recall scores
contexts) |
+---> Faithfulness Evaluator ---> Faithfulness scores
Each evaluator is an independent component with no connections to other evaluators. The pipeline orchestrator ensures:
- All evaluators receive the correct input data.
- Evaluators run in an efficient order (independent components can conceptually run in parallel).
- All results are collected and returned in a unified output structure.
Input Structure
Inputs to the evaluation pipeline are structured as per-evaluator dictionaries:
{
"evaluator_1": {"ground_truth_documents": [...], "retrieved_documents": [...]},
"evaluator_2": {"ground_truth_answers": [...], "predicted_answers": [...]},
"evaluator_3": {"questions": [...], "contexts": [...], "predicted_answers": [...]},
}
This explicit mapping ensures each evaluator receives exactly the inputs it expects, even when different evaluators require different input formats.
Output Structure
The pipeline returns outputs keyed by evaluator component name:
{
"evaluator_1": {"score": 0.85, "individual_scores": [...]},
"evaluator_2": {"score": 0.72, "individual_scores": [...]},
"evaluator_3": {"score": 0.90, "individual_scores": [...], "results": [...]},
}
Pipeline as Evaluation Framework
Using the standard Pipeline for evaluation provides several advantages:
- Consistency: The same orchestration engine used for inference handles evaluation.
- Composability: Evaluators can be mixed and matched freely.
- Serialization: Evaluation pipelines can be saved, loaded, and shared.
- Tracing: Built-in tracing captures per-component timing and inputs/outputs.
Separation of Concerns
The evaluation pipeline pattern separates:
- Data preparation: Collecting ground truth, predictions, and contexts.
- Metric computation: Each evaluator focuses on one specific metric.
- Result aggregation: The
EvaluationRunResultclass handles reporting.
This separation makes it easy to add new metrics, change evaluation data, or adjust pipeline structure without affecting other components.
When to Use Evaluation Pipeline Execution
- Multi-metric evaluation: When computing several metrics on the same data.
- Standardized evaluation workflows: When evaluation should be reproducible and shareable.
- CI/CD integration: When evaluation is part of an automated pipeline validation process.
- Evaluation experiments: When comparing different evaluation metric combinations.
Limitations
- Each evaluator must be independently addressed in the input dictionary -- there is no automatic input broadcasting.
- The same data may need to be provided to multiple evaluators explicitly (no implicit sharing).
- LLM-based evaluators (Faithfulness, Context Relevance) incur API costs and latency.
Relationship to Implementation
In the Haystack framework, this principle is realized by the standard Pipeline.run() method, the same API used for inference pipelines:
- Evaluator components are added to the pipeline with
add_component(). - Since evaluators have no inter-component connections, they run as independent leaf nodes.
- The
run()method accepts a dictionary keyed by component names. - Results from all evaluators are collected in the pipeline output.
Related Principles
- Evaluation Result Reporting -- consumes the outputs of evaluation pipeline execution for structured reporting.
- Retrieval MRR Evaluation, Retrieval MAP Evaluation, Retrieval Recall Evaluation -- individual metric principles computed within the pipeline.
- Faithfulness Evaluation, Context Relevance Evaluation -- LLM-based metrics computed within the pipeline.