Workflow:Explodinggradients Ragas RAG Evaluation

Knowledge Sources	Ragas Ragas Docs RAG Eval Tutorial
Domains	LLMs, RAG, Evaluation
Last Updated	2026-02-10 06:00 GMT

Overview

End-to-end process for evaluating a Retrieval-Augmented Generation (RAG) system using Ragas custom metrics and the experiment framework.

Description

This workflow outlines the standard procedure for evaluating a RAG application's quality using the Ragas toolkit. It covers building a simple RAG system with retrieval and generation components, preparing an evaluation dataset with questions and grading criteria, defining custom evaluation metrics (discrete or numeric), and running experiments to score the RAG system's outputs. The process uses the modern @experiment decorator pattern for orchestrating evaluations, Dataset for data management with pluggable storage backends, and DiscreteMetric/NumericMetric for flexible scoring. Results are automatically persisted and can be compared across experiment runs.

Usage

Execute this workflow when you have a RAG application (retrieval + LLM generation) and need to systematically evaluate its response quality. You should have a set of test questions with expected answers or grading criteria, and want to produce objective quality scores. This is the primary "golden path" for any team building RAG-based Q&A, search, or information retrieval systems.

Execution Steps

Step 1: Build or Configure the RAG System

Set up the RAG pipeline that will be evaluated. This involves configuring a retrieval component (keyword search, vector search, or hybrid) and a generation component (LLM). The RAG system should accept a question as input and return a response along with retrieved context documents. Optionally, enable trace logging to capture retrieval and generation details for debugging.

Key considerations:

The RAG system must expose a callable interface that the experiment can invoke
Trace logging (JSON format) helps diagnose retrieval and generation failures
Both naive (single-shot retrieval) and agentic (iterative retrieval) modes can be evaluated

Step 2: Prepare the Evaluation Dataset

Create or load a dataset containing test questions, expected answers or grading notes, and any additional metadata. The dataset is managed via the Ragas Dataset class with a pluggable storage backend (CSV, JSONL, or in-memory). Each row represents one evaluation sample with fields like question, expected answer, and grading criteria.

Key considerations:

Use a Pydantic model to define the dataset schema for type safety
The Dataset class supports append, save, load, and train/test split operations
Backend options include local CSV, local JSONL, Google Drive, and in-memory storage

Step 3: Define Evaluation Metrics

Define one or more metrics to score the RAG system's outputs. Ragas provides both LLM-based metrics (DiscreteMetric, NumericMetric) and traditional NLP metrics (BLEU, ROUGE, semantic similarity). For RAG evaluation, common choices include correctness (pass/fail), faithfulness, context precision, and context recall.

Key considerations:

DiscreteMetric scores outputs into categorical buckets (e.g., "pass"/"fail")
NumericMetric scores outputs on a continuous range (e.g., 0.0 to 1.0)
Custom metrics can be created via the @discrete_metric or @numeric_metric decorator
LLM-based metrics require an LLM instance created via llm_factory()

Step 4: Run the Experiment

Use the @experiment decorator to wrap the evaluation function, which processes each dataset row by invoking the RAG system and scoring the output. The experiment runner iterates over the dataset asynchronously, collecting predictions and metric scores into an Experiment result table.

Key considerations:

The @experiment decorator wraps both sync and async functions
experiment.arun(dataset, name=...) runs the evaluation and persists results
Results are stored as an Experiment DataTable with the same backend as the dataset
Multiple experiments can be compared by loading results from different runs

Step 5: Analyze Results and Iterate

Review the experiment results to identify quality issues. Examine per-sample scores, aggregate statistics, and error patterns. Use the results to guide improvements to the RAG system (better retrieval, improved prompts, different LLMs) and re-run the experiment to measure progress.

Key considerations:

Results can be exported to pandas DataFrame for analysis
The experiment versioning system can create git commits for reproducibility
Compare experiments across different configurations (models, retrieval strategies)
Use the CLI to display results in rich formatted tables with baseline comparison

Execution Diagram

GitHub URL

Workflow Repository