Principle:Explodinggradients Ragas Legacy Evaluation Pipeline
Legacy Evaluation Pipeline
Legacy Evaluation Pipeline is a principle in the Ragas evaluation toolkit that describes the original batch evaluation pattern for running multiple metrics across a dataset in a single function call. This pipeline has been deprecated in favor of the @experiment decorator pattern.
Motivation
Evaluating LLM applications requires running multiple metrics (faithfulness, answer correctness, context precision, etc.) across many samples. The legacy evaluation pipeline provided a single entry point that handled:
- Running multiple metrics on each sample in the dataset.
- Managing LLM and embedding model injection into metrics that require them.
- Parallel execution of metric scoring tasks.
- Aggregating individual scores into a unified result object.
- Tracking costs, traces, and callbacks throughout the evaluation.
Theoretical Foundation
Batch Evaluation Orchestration
The evaluation pipeline follows an orchestration pattern where a central function coordinates multiple independent scoring operations:
- Dataset validation -- The pipeline verifies that the dataset contains the columns required by each metric and that the sample types (single-turn vs. multi-turn) are supported.
- Model injection -- For metrics that require an LLM or embedding model but do not have one set, the pipeline injects the globally provided LLM/embeddings. If none are provided, it falls back to creating a default OpenAI-based model.
- Metric initialization -- Each metric's
init()method is called with the runtime configuration to prepare internal state. - Task submission -- For each sample in the dataset and each metric, an async scoring task is submitted to an executor. The executor manages concurrency, timeout, and retry logic.
- Result collection -- After all tasks complete, results are organized into a scores matrix (samples x metrics) and wrapped in an
EvaluationResultobject. - Cleanup -- Model references injected by the pipeline are reset to
Noneto avoid side effects.
Parallel Execution
The pipeline uses an Executor class that manages an async task pool. Each metric-sample combination is submitted as an independent coroutine. The executor supports:
- Batch size control -- Limiting the number of concurrent tasks.
- Timeout -- Per-task timeout from the
RunConfig. - Progress tracking -- An optional progress bar showing evaluation progress.
- Error handling -- Failed tasks can either raise exceptions or return
NaNvalues.
Callback Integration
The pipeline integrates with LangChain's callback system to provide:
- Tracing -- A
RagasTracerrecords the execution trace of each metric for debugging and analysis. - Cost tracking -- An optional
CostCallbackHandleraccumulates token usage from LLM calls. - Custom callbacks -- Users can provide additional callbacks for logging, monitoring, or integration with observability platforms.
The callback hierarchy creates nested groups: an evaluation-level group contains row-level groups, which contain metric-level calls.
Default Metrics
When no metrics are specified, the pipeline defaults to a standard set: answer relevancy, context precision, faithfulness, and context recall. These four metrics provide a comprehensive view of RAG pipeline quality.
Deprecation
The evaluate() function and its async counterpart aevaluate() are deprecated and emit a DeprecationWarning when called. The recommended replacement is the @experiment decorator pattern, which provides:
- A more flexible experiment tracking model.
- Better integration with modern async patterns.
- Improved composability with custom evaluation logic.
Documentation for the replacement pattern is available at the Ragas experiment documentation.
Relationship to Other Components
The legacy evaluation pipeline serves as the execution engine for the optimization process. Both the GeneticOptimizer and DSPyOptimizer internally call evaluate() to score candidate prompts against annotated datasets. In this context, the pipeline is used as an internal utility rather than a user-facing API.
Implemented By
See Also
- Implementation:Explodinggradients_Ragas_Evaluate_Function
- Genetic Prompt Optimization -- Uses the pipeline internally for candidate evaluation.
- Human Annotation Collection -- Provides datasets processed by the pipeline.
- Optimization Loss Functions -- Applied to pipeline results during optimization.
- Heuristic:Explodinggradients_Ragas_Deprecation_Migration_Guide