Workflow:Deepset ai Haystack RAG Evaluation Pipeline

Knowledge Sources	Haystack Haystack Docs
Domains	LLMs, RAG, Evaluation, MLOps
Last Updated	2026-02-11 20:00 GMT

Overview

End-to-end process for evaluating RAG pipeline quality using multiple retrieval and generation metrics, with support for comparative analysis across pipeline configurations.

Description

This workflow implements a comprehensive evaluation framework for RAG pipelines in Haystack. It runs a RAG pipeline against a set of evaluation questions with known ground truth answers and source documents, then feeds the results through an evaluation pipeline containing multiple evaluators. Metrics cover retrieval quality (Document MRR, MAP, Recall), answer quality (Semantic Answer Similarity, Faithfulness), and context quality (Context Relevance). The EvaluationRunResult class provides aggregated and detailed reports, with support for comparing two pipeline configurations side by side.

Usage

Execute this workflow when you need to measure and compare RAG pipeline quality. Typical triggers include selecting between retrieval strategies (e.g., different top_k values, embedding models), validating pipeline changes before deployment, establishing quality baselines, or performing A/B testing of pipeline configurations.

Execution Steps

Step 1: Prepare Evaluation Dataset

Assemble a set of evaluation questions, each with a ground truth answer and a list of ground truth source document identifiers. Documents are loaded and indexed into a document store using the standard indexing pipeline.

Key considerations:

Each evaluation question needs: question text, expected answer, ground truth document names
Documents are indexed with embeddings using SentenceTransformersDocumentEmbedder
Use DuplicatePolicy.SKIP to avoid re-indexing existing documents

Step 2: Run RAG Pipeline on Evaluation Questions

Execute the RAG pipeline for each evaluation question and collect the predicted answers, retrieved documents, and context passages. This produces the raw data needed for metric computation.

Key considerations:

Capture retrieved documents (for retrieval metrics)
Capture context content (for faithfulness and relevance metrics)
Capture predicted answer text (for answer similarity metrics)

Step 3: Configure Evaluation Pipeline

Build an evaluation pipeline with multiple independent evaluator components. Each evaluator computes a specific metric and operates on different input combinations.

Evaluators included:

DocumentMRREvaluator: Mean Reciprocal Rank of retrieved documents
DocumentMAPEvaluator: Mean Average Precision of retrieved documents
DocumentRecallEvaluator: Recall in both single-hit and multi-hit modes
SASEvaluator: Semantic Answer Similarity between predicted and ground truth answers
FaithfulnessEvaluator: Whether the answer is faithful to the provided context (LLM-based)
ContextRelevanceEvaluator: Whether retrieved contexts are relevant to the question (LLM-based)

Step 4: Run Evaluation Pipeline

Execute the evaluation pipeline with the collected inputs mapped to each evaluator's expected inputs: ground truth documents, retrieved documents, questions, contexts, predicted answers, and ground truth answers.

Key considerations:

Each evaluator receives only the inputs it needs
LLM-based evaluators (Faithfulness, ContextRelevance) require an OpenAI API key
SASEvaluator uses a sentence-transformers model for scoring

Step 5: Generate Evaluation Reports

Construct an EvaluationRunResult from the evaluation outputs and generate aggregated and detailed reports. The aggregated report provides per-metric scores; the detailed report shows per-question breakdowns.

Key considerations:

aggregated_report() returns overall metric scores
detailed_report() returns per-question metric values alongside inputs
Each report includes all seven configured metrics

Step 6: Compare Pipeline Configurations

Optionally run a second RAG pipeline variant (e.g., different top_k) through the same evaluation process and use comparative_detailed_report() to generate a side-by-side comparison of both configurations.

Key considerations:

Both evaluation runs must use the same questions and ground truth
Comparative report prefixes metric names with run names for disambiguation
Enables data-driven pipeline optimization decisions

Execution Diagram

GitHub URL

Workflow Repository