Workflow:Vibrantlabsai Ragas Experiment Driven Development

Knowledge Sources	Ragas Ragas Docs Experimentation Guide
Domains	LLM_Ops, Evaluation, Experimentation
Last Updated	2026-02-12 10:00 GMT

Overview

End-to-end process for running structured evaluation experiments using the Ragas @experiment() decorator with persistent storage, concurrent execution, and iterative comparison across runs.

Description

This workflow covers the recommended approach for evaluating LLM applications using Ragas' experiment framework. Instead of ad-hoc evaluation scripts, the @experiment() decorator provides a structured pattern for running evaluation functions over datasets with automatic result persistence, concurrent execution, and cross-run comparison. The framework supports pluggable storage backends (CSV, JSONL, Google Drive, in-memory), parameterized experiments for A/B testing, and CLI-based result visualization with regression detection.

Key outputs:

Experiment results persisted to configurable backends (CSV, JSONL, GDrive)
Structured result rows with metrics, metadata, and traceability
Comparable experiment runs for iterative improvement tracking

Usage

Execute this workflow when you want to systematically evaluate any LLM application (RAG, agent, prompt, workflow) with reproducible experiments that can be compared across iterations. This is the recommended approach for teams practicing evaluation-driven development where each change to the system is validated against a test dataset before deployment.

Execution Steps

Step 1: Create_Dataset

Create a Dataset containing test samples for evaluation. Each row represents a test case with input fields (e.g., question, expected_answer). The Dataset class provides a list-like interface backed by a storage backend. Data can be loaded from CSV files, created programmatically via append(), or converted from existing data sources.

Key considerations:

Choose a backend: local/csv, local/jsonl, gdrive, or in-memory
Each row is a dictionary with string keys and typed values
Datasets are saved to a root directory for persistence
The same dataset should be reused across experiment iterations

Step 2: Define_Metrics

Define the evaluation metrics that will score each test sample. Metrics can be built-in Ragas metrics, custom metrics created via decorators, or LLM-based metrics using DiscreteMetric/NumericMetric classes. Each metric provides a score() or ascore() method that returns a MetricResult.

Key considerations:

Select metrics that measure the specific quality dimensions you care about
Metrics can be combined to evaluate multiple aspects per sample
LLM-based metrics require an LLM instance to be passed at scoring time

Step 3: Write_Experiment_Function

Define an async function decorated with @experiment() that processes individual dataset rows. The function receives a row dictionary, runs the system under test, scores the output with metrics, and returns an augmented result dictionary. Additional parameters beyond the row can be passed at runtime for A/B testing scenarios.

What happens:

The decorated function becomes an ExperimentWrapper instance
Additional keyword arguments are forwarded from arun() to the function
The function should return a dictionary with all input fields plus outputs and scores
Error handling is built into the framework for graceful failure tracking

Step 4: Run_Experiment

Call experiment_function.arun(dataset) to execute the experiment. The framework generates a unique experiment name (or uses a provided prefix), creates async tasks for all dataset items, executes them concurrently with progress tracking, and persists results to the configured backend.

What happens:

Tasks are created for each row in the dataset
Concurrent execution with automatic progress display
Results are appended to an Experiment DataTable
The experiment is saved after all tasks complete

Step 5: Compare_And_Iterate

Analyze experiment results and compare across runs using the CLI or direct DataFrame access. The Ragas CLI provides formatted comparison tables showing metric deltas between current and baseline runs with pass/fail gates for regression detection. Use insights from results to modify the system under test and re-run the experiment.

Key considerations:

Use ragas evals command for formatted metric comparison
Numeric metrics show delta values with up/down arrows for improvement/regression
Categorical metrics show count distributions with deltas
Iterate by modifying the system and re-running with the same dataset

Execution Diagram

GitHub URL

Workflow Repository