Workflow:Vibrantlabsai Ragas RAG Evaluation

Knowledge Sources	Ragas Ragas Docs RAG Eval Tutorial
Domains	LLM_Ops, RAG, Evaluation
Last Updated	2026-02-12 10:00 GMT

Overview

End-to-end process for evaluating Retrieval-Augmented Generation (RAG) systems using Ragas metrics to measure retrieval quality, faithfulness, and answer correctness.

Description

This workflow covers the standard procedure for evaluating RAG pipelines with the Ragas evaluation toolkit. It measures how well a RAG system retrieves relevant context, whether the generated response is faithful to that context, and whether the final answer is factually correct. The process collects query-response-context triplets from the RAG system and scores them using both LLM-based and embedding-based metrics. The evaluation supports multiple metrics including Faithfulness, ContextPrecision, ContextRecall, AnswerRelevancy, and FactualCorrectness.

Key outputs:

Per-sample scores for each metric
Aggregate scores across the evaluation dataset
An EvaluationResult object with exportable results

Usage

Execute this workflow when you have a functioning RAG system and need to quantify its performance across retrieval and generation quality dimensions. This is appropriate when you have a set of test queries with expected answers and want to assess how well your retrieval pipeline ranks relevant documents and how faithfully the LLM generates answers from retrieved context.

Execution Steps

Step 1: Prepare_Evaluation_Dataset

Collect test samples containing queries, expected answers, retrieved contexts, and generated responses from the RAG system. Each sample is structured as a SingleTurnSample with fields for user_input, response, retrieved_contexts, and reference. These samples are assembled into an EvaluationDataset.

Key considerations:

Each sample must contain the fields required by the chosen metrics
Faithfulness requires response and retrieved_contexts
ContextPrecision requires user_input, retrieved_contexts, and reference
AnswerRelevancy requires user_input and response
FactualCorrectness requires response and reference

Step 2: Configure_LLM_And_Embeddings

Initialize the LLM and embedding model that Ragas will use as the evaluator (judge). This is separate from the LLM used in the RAG system itself. Ragas supports multiple providers including OpenAI, Anthropic, Google, and others through its llm_factory and embedding abstractions. The evaluator LLM is used by metrics that require LLM reasoning (e.g., Faithfulness, ContextPrecision), while the embedding model is used by metrics that compute semantic similarity (e.g., AnswerRelevancy, SemanticSimilarity).

Key considerations:

If no LLM is specified, Ragas defaults to OpenAI gpt-4o-mini
The evaluator LLM should be capable enough to judge quality accurately
LangChain LLM wrappers can be used via LangchainLLMWrapper

Step 3: Select_Metrics

Choose which evaluation metrics to run against the dataset. Ragas provides pre-built metrics for RAG evaluation organized into retrieval metrics (ContextPrecision, ContextRecall, ContextEntityRecall) and generation metrics (Faithfulness, FactualCorrectness, AnswerRelevancy, AnswerCorrectness). If no metrics are specified, Ragas defaults to answer_relevancy, context_precision, faithfulness, and context_recall.

Key considerations:

Select metrics that match your evaluation goals
Ensure dataset columns satisfy each metric's required fields
Metrics can be combined freely in a single evaluation run

Step 4: Run_Evaluation

Call the evaluate() or aevaluate() function with the dataset, metrics, LLM, and embeddings. The Executor manages concurrent async evaluation with configurable concurrency limits, timeouts, and retries via RunConfig. Each metric scores every sample in the dataset, producing per-sample and aggregate results.

What happens:

Metrics are initialized with the provided LLM and embeddings
An Executor creates async tasks for each (sample, metric) pair
Tasks run concurrently with semaphore-based rate limiting
Results are collected into an EvaluationResult object

Step 5: Analyze_Results

Inspect the EvaluationResult to understand RAG system performance. Results can be exported to a pandas DataFrame for analysis, converted to a HuggingFace Dataset for sharing, or saved to CSV. Per-sample scores help identify specific failure cases, while aggregate scores provide overall quality metrics.

Key considerations:

Low Faithfulness scores indicate hallucination issues
Low ContextPrecision suggests retrieval ranking problems
Low ContextRecall means relevant information is not being retrieved
Results can be tracked over time using the experiment framework

Execution Diagram

GitHub URL

Workflow Repository