Principle:Explodinggradients Ragas RAG System Interface

Knowledge Sources	Domains	Last Updated
`examples/ragas_examples/rag_eval/rag.py`, `examples/ragas_examples/improve_rag/rag.py`, `examples/ragas_examples/rag_eval/evals.py`, `examples/ragas_examples/improve_rag/evals.py`	RAG Evaluation, LLM Testing, Retrieval-Augmented Generation	2026-02-10

Overview

Description

The RAG System Interface principle defines a standard contract that any Retrieval Augmented Generation (RAG) system must implement in order to be systematically evaluated by Ragas. Rather than coupling evaluation logic to specific RAG implementation details such as the choice of retriever, embedding model, or language model, this principle establishes a uniform interface through which evaluation harnesses interact with RAG systems. The system under test accepts a natural language question and returns a structured dictionary containing at minimum the generated response text and, where applicable, the list of retrieved context documents.

This abstraction allows the same evaluation metrics, datasets, and experiment infrastructure to be applied across fundamentally different RAG architectures -- from simple keyword-matching retrievers to BM25-based pipelines to agentic multi-step retrieval systems -- without modifying the evaluation code itself.

Usage

When building a RAG system intended for evaluation with Ragas, implementers define a class with a query method (or equivalent callable) that:

Accepts a question string as its primary argument
Optionally accepts retrieval parameters such as top_k
Returns a dictionary with keys such as "answer", "retrieved_documents", and optional metadata like "run_id" or "logs"

The evaluation harness (the @experiment() decorated function) then calls this interface for each row in the dataset, extracts the response, and passes it to evaluation metrics. Because the interface is stable, the RAG system implementation can be iterated on independently of the evaluation pipeline.

Theoretical Basis

Separation of Concerns

RAG evaluation requires a clear boundary between the system under test and the evaluation infrastructure. Without a standardized interface, evaluation code becomes tightly coupled to the internals of a particular RAG system -- referencing specific retriever classes, prompt templates, or LLM client APIs. This coupling makes it impossible to reuse evaluation logic across different RAG implementations or to compare systems fairly.

The RAG System Interface principle resolves this by establishing that the evaluation harness only depends on a single contract: "given a question, return a structured response." This mirrors the concept of interface-based programming where consumers depend on abstractions rather than concrete implementations.

Structured Output for Multi-Dimensional Evaluation

RAG evaluation is inherently multi-dimensional. A single query exercises the retrieval subsystem (did it find relevant documents?), the generation subsystem (did it produce a faithful answer?), and the integration of both (is the answer grounded in the retrieved context?). By requiring the interface to return both the generated answer and the retrieved contexts in a structured dictionary, this principle ensures that downstream metrics have access to all the data they need:

Answer-level metrics (e.g., correctness, relevance) use the "answer" field
Retrieval-level metrics (e.g., context precision, context recall) use the "retrieved_documents" or "retrieved_contexts" field
Faithfulness metrics use both the answer and the retrieved context together

Enabling Fair Comparison

When multiple RAG implementations conform to the same interface, they can be evaluated against the same dataset with the same metrics. This enables controlled comparisons such as:

Keyword retriever versus BM25 retriever versus vector similarity retriever
Naive single-pass RAG versus agentic multi-step RAG
Different LLM backends (GPT-4o, GPT-4o-mini, etc.) with identical retrieval

The Ragas repository demonstrates this directly by providing two distinct RAG implementations -- ExampleRAG (synchronous, keyword-based) and RAG (async, BM25-based with naive/agentic modes) -- both conforming to the same interface pattern and both evaluated using the same @experiment() framework.

Practical Guide

Defining the Interface

The minimal interface requires:

class MyRAG:
    def query(self, question: str, **kwargs) -> dict:
        """
        Returns:
            {
                "answer": str,                    # Generated response text
                "retrieved_documents": List[dict], # List of retrieved context dicts
                ...                                # Optional metadata
            }
        """
        ...

Wiring into Evaluation

The experiment function calls the interface and feeds results into metrics:

from ragas import experiment

@experiment()
async def run_experiment(row, rag, llm):
    response = rag.query(row["question"])
    score = my_metric.score(
        response=response.get("answer", ""),
        grading_notes=row["grading_notes"],
        llm=llm,
    )
    return {**row, "response": response["answer"], "score": score.value}

Key Design Decisions

Dictionary return type rather than a custom dataclass: This keeps the interface language-agnostic and easy to extend without breaking compatibility.
Optional metadata fields: Fields like "run_id", "logs", and "mlflow_trace_id" are encouraged but not required, allowing simple systems to remain simple while production systems can include full observability.
Synchronous and asynchronous variants: The interface supports both query() (sync) and await query() (async), as demonstrated by the two example implementations in the repository.

Related Pages

Implementation:Explodinggradients_Ragas_RAG_Query_Function_Pattern

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment