Workflow:Vibrantlabsai Ragas Experiment Driven Development
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Evaluation, Experimentation |
| Last Updated | 2026-02-12 10:00 GMT |
Overview
End-to-end process for running structured evaluation experiments using the Ragas @experiment() decorator with persistent storage, concurrent execution, and iterative comparison across runs.
Description
This workflow covers the recommended approach for evaluating LLM applications using Ragas' experiment framework. Instead of ad-hoc evaluation scripts, the @experiment() decorator provides a structured pattern for running evaluation functions over datasets with automatic result persistence, concurrent execution, and cross-run comparison. The framework supports pluggable storage backends (CSV, JSONL, Google Drive, in-memory), parameterized experiments for A/B testing, and CLI-based result visualization with regression detection.
Key outputs:
- Experiment results persisted to configurable backends (CSV, JSONL, GDrive)
- Structured result rows with metrics, metadata, and traceability
- Comparable experiment runs for iterative improvement tracking
Usage
Execute this workflow when you want to systematically evaluate any LLM application (RAG, agent, prompt, workflow) with reproducible experiments that can be compared across iterations. This is the recommended approach for teams practicing evaluation-driven development where each change to the system is validated against a test dataset before deployment.
Execution Steps
Step 1: Create_Dataset
Create a Dataset containing test samples for evaluation. Each row represents a test case with input fields (e.g., question, expected_answer). The Dataset class provides a list-like interface backed by a storage backend. Data can be loaded from CSV files, created programmatically via append(), or converted from existing data sources.
Key considerations:
- Choose a backend: local/csv, local/jsonl, gdrive, or in-memory
- Each row is a dictionary with string keys and typed values
- Datasets are saved to a root directory for persistence
- The same dataset should be reused across experiment iterations
Step 2: Define_Metrics
Define the evaluation metrics that will score each test sample. Metrics can be built-in Ragas metrics, custom metrics created via decorators, or LLM-based metrics using DiscreteMetric/NumericMetric classes. Each metric provides a score() or ascore() method that returns a MetricResult.
Key considerations:
- Select metrics that measure the specific quality dimensions you care about
- Metrics can be combined to evaluate multiple aspects per sample
- LLM-based metrics require an LLM instance to be passed at scoring time
Step 3: Write_Experiment_Function
Define an async function decorated with @experiment() that processes individual dataset rows. The function receives a row dictionary, runs the system under test, scores the output with metrics, and returns an augmented result dictionary. Additional parameters beyond the row can be passed at runtime for A/B testing scenarios.
What happens:
- The decorated function becomes an ExperimentWrapper instance
- Additional keyword arguments are forwarded from arun() to the function
- The function should return a dictionary with all input fields plus outputs and scores
- Error handling is built into the framework for graceful failure tracking
Step 4: Run_Experiment
Call experiment_function.arun(dataset) to execute the experiment. The framework generates a unique experiment name (or uses a provided prefix), creates async tasks for all dataset items, executes them concurrently with progress tracking, and persists results to the configured backend.
What happens:
- Tasks are created for each row in the dataset
- Concurrent execution with automatic progress display
- Results are appended to an Experiment DataTable
- The experiment is saved after all tasks complete
Step 5: Compare_And_Iterate
Analyze experiment results and compare across runs using the CLI or direct DataFrame access. The Ragas CLI provides formatted comparison tables showing metric deltas between current and baseline runs with pass/fail gates for regression detection. Use insights from results to modify the system under test and re-run the experiment.
Key considerations:
- Use ragas evals command for formatted metric comparison
- Numeric metrics show delta values with up/down arrows for improvement/regression
- Categorical metrics show count distributions with deltas
- Iterate by modifying the system and re-running with the same dataset