Workflow:Vibrantlabsai Ragas RAG Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, RAG, Evaluation |
| Last Updated | 2026-02-12 10:00 GMT |
Overview
End-to-end process for evaluating Retrieval-Augmented Generation (RAG) systems using Ragas metrics to measure retrieval quality, faithfulness, and answer correctness.
Description
This workflow covers the standard procedure for evaluating RAG pipelines with the Ragas evaluation toolkit. It measures how well a RAG system retrieves relevant context, whether the generated response is faithful to that context, and whether the final answer is factually correct. The process collects query-response-context triplets from the RAG system and scores them using both LLM-based and embedding-based metrics. The evaluation supports multiple metrics including Faithfulness, ContextPrecision, ContextRecall, AnswerRelevancy, and FactualCorrectness.
Key outputs:
- Per-sample scores for each metric
- Aggregate scores across the evaluation dataset
- An EvaluationResult object with exportable results
Usage
Execute this workflow when you have a functioning RAG system and need to quantify its performance across retrieval and generation quality dimensions. This is appropriate when you have a set of test queries with expected answers and want to assess how well your retrieval pipeline ranks relevant documents and how faithfully the LLM generates answers from retrieved context.
Execution Steps
Step 1: Prepare_Evaluation_Dataset
Collect test samples containing queries, expected answers, retrieved contexts, and generated responses from the RAG system. Each sample is structured as a SingleTurnSample with fields for user_input, response, retrieved_contexts, and reference. These samples are assembled into an EvaluationDataset.
Key considerations:
- Each sample must contain the fields required by the chosen metrics
- Faithfulness requires response and retrieved_contexts
- ContextPrecision requires user_input, retrieved_contexts, and reference
- AnswerRelevancy requires user_input and response
- FactualCorrectness requires response and reference
Step 2: Configure_LLM_And_Embeddings
Initialize the LLM and embedding model that Ragas will use as the evaluator (judge). This is separate from the LLM used in the RAG system itself. Ragas supports multiple providers including OpenAI, Anthropic, Google, and others through its llm_factory and embedding abstractions. The evaluator LLM is used by metrics that require LLM reasoning (e.g., Faithfulness, ContextPrecision), while the embedding model is used by metrics that compute semantic similarity (e.g., AnswerRelevancy, SemanticSimilarity).
Key considerations:
- If no LLM is specified, Ragas defaults to OpenAI gpt-4o-mini
- The evaluator LLM should be capable enough to judge quality accurately
- LangChain LLM wrappers can be used via LangchainLLMWrapper
Step 3: Select_Metrics
Choose which evaluation metrics to run against the dataset. Ragas provides pre-built metrics for RAG evaluation organized into retrieval metrics (ContextPrecision, ContextRecall, ContextEntityRecall) and generation metrics (Faithfulness, FactualCorrectness, AnswerRelevancy, AnswerCorrectness). If no metrics are specified, Ragas defaults to answer_relevancy, context_precision, faithfulness, and context_recall.
Key considerations:
- Select metrics that match your evaluation goals
- Ensure dataset columns satisfy each metric's required fields
- Metrics can be combined freely in a single evaluation run
Step 4: Run_Evaluation
Call the evaluate() or aevaluate() function with the dataset, metrics, LLM, and embeddings. The Executor manages concurrent async evaluation with configurable concurrency limits, timeouts, and retries via RunConfig. Each metric scores every sample in the dataset, producing per-sample and aggregate results.
What happens:
- Metrics are initialized with the provided LLM and embeddings
- An Executor creates async tasks for each (sample, metric) pair
- Tasks run concurrently with semaphore-based rate limiting
- Results are collected into an EvaluationResult object
Step 5: Analyze_Results
Inspect the EvaluationResult to understand RAG system performance. Results can be exported to a pandas DataFrame for analysis, converted to a HuggingFace Dataset for sharing, or saved to CSV. Per-sample scores help identify specific failure cases, while aggregate scores provide overall quality metrics.
Key considerations:
- Low Faithfulness scores indicate hallucination issues
- Low ContextPrecision suggests retrieval ranking problems
- Low ContextRecall means relevant information is not being retrieved
- Results can be tracked over time using the experiment framework