Workflow:Explodinggradients Ragas RAG Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RAG, Evaluation |
| Last Updated | 2026-02-10 06:00 GMT |
Overview
End-to-end process for evaluating a Retrieval-Augmented Generation (RAG) system using Ragas custom metrics and the experiment framework.
Description
This workflow outlines the standard procedure for evaluating a RAG application's quality using the Ragas toolkit. It covers building a simple RAG system with retrieval and generation components, preparing an evaluation dataset with questions and grading criteria, defining custom evaluation metrics (discrete or numeric), and running experiments to score the RAG system's outputs. The process uses the modern @experiment decorator pattern for orchestrating evaluations, Dataset for data management with pluggable storage backends, and DiscreteMetric/NumericMetric for flexible scoring. Results are automatically persisted and can be compared across experiment runs.
Usage
Execute this workflow when you have a RAG application (retrieval + LLM generation) and need to systematically evaluate its response quality. You should have a set of test questions with expected answers or grading criteria, and want to produce objective quality scores. This is the primary "golden path" for any team building RAG-based Q&A, search, or information retrieval systems.
Execution Steps
Step 1: Build or Configure the RAG System
Set up the RAG pipeline that will be evaluated. This involves configuring a retrieval component (keyword search, vector search, or hybrid) and a generation component (LLM). The RAG system should accept a question as input and return a response along with retrieved context documents. Optionally, enable trace logging to capture retrieval and generation details for debugging.
Key considerations:
- The RAG system must expose a callable interface that the experiment can invoke
- Trace logging (JSON format) helps diagnose retrieval and generation failures
- Both naive (single-shot retrieval) and agentic (iterative retrieval) modes can be evaluated
Step 2: Prepare the Evaluation Dataset
Create or load a dataset containing test questions, expected answers or grading notes, and any additional metadata. The dataset is managed via the Ragas Dataset class with a pluggable storage backend (CSV, JSONL, or in-memory). Each row represents one evaluation sample with fields like question, expected answer, and grading criteria.
Key considerations:
- Use a Pydantic model to define the dataset schema for type safety
- The Dataset class supports append, save, load, and train/test split operations
- Backend options include local CSV, local JSONL, Google Drive, and in-memory storage
Step 3: Define Evaluation Metrics
Define one or more metrics to score the RAG system's outputs. Ragas provides both LLM-based metrics (DiscreteMetric, NumericMetric) and traditional NLP metrics (BLEU, ROUGE, semantic similarity). For RAG evaluation, common choices include correctness (pass/fail), faithfulness, context precision, and context recall.
Key considerations:
- DiscreteMetric scores outputs into categorical buckets (e.g., "pass"/"fail")
- NumericMetric scores outputs on a continuous range (e.g., 0.0 to 1.0)
- Custom metrics can be created via the @discrete_metric or @numeric_metric decorator
- LLM-based metrics require an LLM instance created via llm_factory()
Step 4: Run the Experiment
Use the @experiment decorator to wrap the evaluation function, which processes each dataset row by invoking the RAG system and scoring the output. The experiment runner iterates over the dataset asynchronously, collecting predictions and metric scores into an Experiment result table.
Key considerations:
- The @experiment decorator wraps both sync and async functions
- experiment.arun(dataset, name=...) runs the evaluation and persists results
- Results are stored as an Experiment DataTable with the same backend as the dataset
- Multiple experiments can be compared by loading results from different runs
Step 5: Analyze Results and Iterate
Review the experiment results to identify quality issues. Examine per-sample scores, aggregate statistics, and error patterns. Use the results to guide improvements to the RAG system (better retrieval, improved prompts, different LLMs) and re-run the experiment to measure progress.
Key considerations:
- Results can be exported to pandas DataFrame for analysis
- The experiment versioning system can create git commits for reproducibility
- Compare experiments across different configurations (models, retrieval strategies)
- Use the CLI to display results in rich formatted tables with baseline comparison