Principle:Explodinggradients Ragas Legacy Evaluation Pipeline

Legacy Evaluation Pipeline

Legacy Evaluation Pipeline is a principle in the Ragas evaluation toolkit that describes the original batch evaluation pattern for running multiple metrics across a dataset in a single function call. This pipeline has been deprecated in favor of the @experiment decorator pattern.

Motivation

Evaluating LLM applications requires running multiple metrics (faithfulness, answer correctness, context precision, etc.) across many samples. The legacy evaluation pipeline provided a single entry point that handled:

Running multiple metrics on each sample in the dataset.
Managing LLM and embedding model injection into metrics that require them.
Parallel execution of metric scoring tasks.
Aggregating individual scores into a unified result object.
Tracking costs, traces, and callbacks throughout the evaluation.

Theoretical Foundation

Batch Evaluation Orchestration

The evaluation pipeline follows an orchestration pattern where a central function coordinates multiple independent scoring operations:

Dataset validation -- The pipeline verifies that the dataset contains the columns required by each metric and that the sample types (single-turn vs. multi-turn) are supported.
Model injection -- For metrics that require an LLM or embedding model but do not have one set, the pipeline injects the globally provided LLM/embeddings. If none are provided, it falls back to creating a default OpenAI-based model.
Metric initialization -- Each metric's init() method is called with the runtime configuration to prepare internal state.
Task submission -- For each sample in the dataset and each metric, an async scoring task is submitted to an executor. The executor manages concurrency, timeout, and retry logic.
Result collection -- After all tasks complete, results are organized into a scores matrix (samples x metrics) and wrapped in an EvaluationResult object.
Cleanup -- Model references injected by the pipeline are reset to None to avoid side effects.

Parallel Execution

The pipeline uses an Executor class that manages an async task pool. Each metric-sample combination is submitted as an independent coroutine. The executor supports:

Batch size control -- Limiting the number of concurrent tasks.
Timeout -- Per-task timeout from the RunConfig.
Progress tracking -- An optional progress bar showing evaluation progress.
Error handling -- Failed tasks can either raise exceptions or return NaN values.

Callback Integration

The pipeline integrates with LangChain's callback system to provide:

Tracing -- A RagasTracer records the execution trace of each metric for debugging and analysis.
Cost tracking -- An optional CostCallbackHandler accumulates token usage from LLM calls.
Custom callbacks -- Users can provide additional callbacks for logging, monitoring, or integration with observability platforms.

The callback hierarchy creates nested groups: an evaluation-level group contains row-level groups, which contain metric-level calls.

Default Metrics

When no metrics are specified, the pipeline defaults to a standard set: answer relevancy, context precision, faithfulness, and context recall. These four metrics provide a comprehensive view of RAG pipeline quality.

Deprecation

The evaluate() function and its async counterpart aevaluate() are deprecated and emit a DeprecationWarning when called. The recommended replacement is the @experiment decorator pattern, which provides:

A more flexible experiment tracking model.
Better integration with modern async patterns.
Improved composability with custom evaluation logic.

Documentation for the replacement pattern is available at the Ragas experiment documentation.

Relationship to Other Components

The legacy evaluation pipeline serves as the execution engine for the optimization process. Both the GeneticOptimizer and DSPyOptimizer internally call evaluate() to score candidate prompts against annotated datasets. In this context, the pipeline is used as an internal utility rather than a user-facing API.

Implemented By

Implementation: Evaluate Function

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment