Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Deepset ai Haystack RAG Evaluation Pipeline

From Leeroopedia
Revision as of 11:00, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Deepset_ai_Haystack_RAG_Evaluation_Pipeline.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains LLMs, RAG, Evaluation, MLOps
Last Updated 2026-02-11 20:00 GMT

Overview

End-to-end process for evaluating RAG pipeline quality using multiple retrieval and generation metrics, with support for comparative analysis across pipeline configurations.

Description

This workflow implements a comprehensive evaluation framework for RAG pipelines in Haystack. It runs a RAG pipeline against a set of evaluation questions with known ground truth answers and source documents, then feeds the results through an evaluation pipeline containing multiple evaluators. Metrics cover retrieval quality (Document MRR, MAP, Recall), answer quality (Semantic Answer Similarity, Faithfulness), and context quality (Context Relevance). The EvaluationRunResult class provides aggregated and detailed reports, with support for comparing two pipeline configurations side by side.

Usage

Execute this workflow when you need to measure and compare RAG pipeline quality. Typical triggers include selecting between retrieval strategies (e.g., different top_k values, embedding models), validating pipeline changes before deployment, establishing quality baselines, or performing A/B testing of pipeline configurations.

Execution Steps

Step 1: Prepare Evaluation Dataset

Assemble a set of evaluation questions, each with a ground truth answer and a list of ground truth source document identifiers. Documents are loaded and indexed into a document store using the standard indexing pipeline.

Key considerations:

  • Each evaluation question needs: question text, expected answer, ground truth document names
  • Documents are indexed with embeddings using SentenceTransformersDocumentEmbedder
  • Use DuplicatePolicy.SKIP to avoid re-indexing existing documents

Step 2: Run RAG Pipeline on Evaluation Questions

Execute the RAG pipeline for each evaluation question and collect the predicted answers, retrieved documents, and context passages. This produces the raw data needed for metric computation.

Key considerations:

  • Capture retrieved documents (for retrieval metrics)
  • Capture context content (for faithfulness and relevance metrics)
  • Capture predicted answer text (for answer similarity metrics)

Step 3: Configure Evaluation Pipeline

Build an evaluation pipeline with multiple independent evaluator components. Each evaluator computes a specific metric and operates on different input combinations.

Evaluators included:

  • DocumentMRREvaluator: Mean Reciprocal Rank of retrieved documents
  • DocumentMAPEvaluator: Mean Average Precision of retrieved documents
  • DocumentRecallEvaluator: Recall in both single-hit and multi-hit modes
  • SASEvaluator: Semantic Answer Similarity between predicted and ground truth answers
  • FaithfulnessEvaluator: Whether the answer is faithful to the provided context (LLM-based)
  • ContextRelevanceEvaluator: Whether retrieved contexts are relevant to the question (LLM-based)

Step 4: Run Evaluation Pipeline

Execute the evaluation pipeline with the collected inputs mapped to each evaluator's expected inputs: ground truth documents, retrieved documents, questions, contexts, predicted answers, and ground truth answers.

Key considerations:

  • Each evaluator receives only the inputs it needs
  • LLM-based evaluators (Faithfulness, ContextRelevance) require an OpenAI API key
  • SASEvaluator uses a sentence-transformers model for scoring

Step 5: Generate Evaluation Reports

Construct an EvaluationRunResult from the evaluation outputs and generate aggregated and detailed reports. The aggregated report provides per-metric scores; the detailed report shows per-question breakdowns.

Key considerations:

  • aggregated_report() returns overall metric scores
  • detailed_report() returns per-question metric values alongside inputs
  • Each report includes all seven configured metrics

Step 6: Compare Pipeline Configurations

Optionally run a second RAG pipeline variant (e.g., different top_k) through the same evaluation process and use comparative_detailed_report() to generate a side-by-side comparison of both configurations.

Key considerations:

  • Both evaluation runs must use the same questions and ground truth
  • Comparative report prefixes metric names with run names for disambiguation
  • Enables data-driven pipeline optimization decisions

Execution Diagram

GitHub URL

Workflow Repository