Implementation:Explodinggradients Ragas Evaluate Function

Evaluate Function

The evaluate() function implements the Legacy Evaluation Pipeline principle in the Ragas evaluation toolkit. It provides a single-call interface for running multiple evaluation metrics across a dataset.

NOTE: This function is DEPRECATED. Use the @experiment decorator instead. See the Ragas experiment documentation for migration guidance.

Source Location

File: src/ragas/evaluation.py
evaluate() function: Lines 349-484
aevaluate() async function: Lines 59-345

The synchronous evaluate() is a thin wrapper around the async aevaluate(), using either nest_asyncio (for Jupyter compatibility) or asyncio.run().

Import

from ragas import evaluate

Or directly:

from ragas.evaluation import evaluate

Function Signature

@track_was_completed
def evaluate(
    dataset: Union[Dataset, EvaluationDataset],
    metrics: Optional[Sequence[Metric]] = None,
    llm: Optional[Union[BaseRagasLLM, LangchainLLM]] = None,
    embeddings: Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding, LangchainEmbeddings]] = None,
    experiment_name: Optional[str] = None,
    callbacks: Callbacks = None,
    run_config: Optional[RunConfig] = None,
    token_usage_parser: Optional[TokenUsageParser] = None,
    raise_exceptions: bool = False,
    column_map: Optional[Dict[str, str]] = None,
    show_progress: bool = True,
    batch_size: Optional[int] = None,
    _run_id: Optional[UUID] = None,
    _pbar: Optional[tqdm] = None,
    return_executor: bool = False,
    allow_nest_asyncio: bool = True,
) -> Union[EvaluationResult, Executor]

Parameters

Parameter	Type	Default	Description
`dataset`	`Union[Dataset, EvaluationDataset]`	required	The dataset to evaluate. Accepts both HuggingFace `Dataset` and Ragas `EvaluationDataset`.
`metrics`	`Optional[Sequence[Metric]]`	`None`	List of metric instances. If `None`, defaults to answer_relevancy, context_precision, faithfulness, context_recall.
`llm`	`Optional[Union[BaseRagasLLM, LangchainLLM]]`	`None`	LLM for metrics that require one. Falls back to `gpt-4o-mini` via OpenAI if not provided.
`embeddings`	`Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding, LangchainEmbeddings]]`	`None`	Embedding model for metrics that require one. Inferred from the LLM provider if not provided.
`experiment_name`	`Optional[str]`	`None`	Name for tracing/tracking the evaluation run.
`callbacks`	`Callbacks`	`None`	LangChain callbacks for lifecycle events.
`run_config`	`Optional[RunConfig]`	`None`	Runtime configuration for timeout and retries.
`token_usage_parser`	`Optional[TokenUsageParser]`	`None`	Parser for extracting token usage from LLM responses. Required for cost calculation.
`raise_exceptions`	`bool`	`False`	If `True`, raises on metric failure. If `False`, returns `NaN` for failed samples.
`column_map`	`Optional[Dict[str, str]]`	`None`	Maps dataset column names to expected names (e.g., `{"contexts": "contexts_v1"}`).
`show_progress`	`bool`	`True`	Whether to display a progress bar.
`batch_size`	`Optional[int]`	`None`	Limits concurrent tasks. If `None`, no batching is applied.
`return_executor`	`bool`	`False`	If `True`, returns the `Executor` instance for cancellable execution instead of running to completion.
`allow_nest_asyncio`	`bool`	`True`	Whether to use `nest_asyncio` for Jupyter compatibility. Set to `False` in production async applications.

Return Value

Default (return_executor=False): Returns an EvaluationResult object containing:
- scores -- List of dictionaries mapping metric names to scores.
- dataset -- The original evaluation dataset.
- binary_columns -- List of metric names that produce binary outputs.
- cost_cb -- Cost callback handler (if token usage parser was provided).
- traces -- Parsed execution traces.
With return_executor=True: Returns the Executor instance, allowing the caller to cancel execution or retrieve results later.

Deprecation Warning

Both evaluate() (line 447-453) and aevaluate() (line 105-111) emit a DeprecationWarning:

warnings.warn(
    "evaluate() is deprecated and will be removed in a future version. "
    "Use the @experiment decorator instead. "
    "See https://docs.ragas.io/en/latest/concepts/experiment/ for more information.",
    DeprecationWarning,
    stacklevel=2,
)

Internal Execution Flow

1. Input Validation and Conversion

Accepts both HuggingFace Dataset and Ragas EvaluationDataset.
Remaps column names using column_map.
Converts v1 dataset format to v2 if needed.
Validates that required columns exist and metrics are supported for the sample type.

2. Model Injection (Lines 164-200)

For each metric:

If the metric requires an LLM (MetricWithLLM) and none is set, the pipeline injects the provided or default LLM.
If the metric requires embeddings (MetricWithEmbeddings) and none is set, the pipeline infers the embedding provider from the LLM or creates a default.
AspectCritic metrics are identified as binary and tracked in binary_columns.
Each metric's init(run_config) is called.

3. Callback Setup (Lines 214-241)

Creates a RagasTracer for execution tracing.
Optionally creates a CostCallbackHandler for token usage tracking.
Creates a top-level evaluation chain group with nested row-level groups.

4. Task Submission (Lines 243-278)

For each sample and metric:

Single-turn samples: submits metric.single_turn_ascore(sample, callbacks)
Multi-turn samples: submits metric.multi_turn_ascore(sample, callbacks)
Tasks are named as {metric_name}-{sample_index} with a timeout from run_config.

5. Result Collection (Lines 284-328)

Collects all results from the executor.
Organizes results into a scores list (one dict per sample, one key per metric).
Handles ModeMetric instances by including the mode in the key name.
Constructs the final EvaluationResult with scores, dataset, traces, and cost information.

6. Cleanup (Lines 329-343)

Resets LLM and embedding references on metrics that were injected by the pipeline.
Flushes the analytics batcher.

Usage Example

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.dataset_schema import EvaluationDataset, SingleTurnSample

# Create dataset
dataset = EvaluationDataset(samples=[
    SingleTurnSample(
        user_input="What is Python?",
        response="Python is a programming language.",
        retrieved_contexts=["Python is a high-level programming language."],
    ),
])

# Run evaluation (deprecated - use @experiment instead)
result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy],
)

print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.87}

# Access per-sample scores
df = result.to_pandas()

Internal Use by Optimizers

The GeneticOptimizer.evaluate_candidate() method (lines 566-595 of genetic.py) calls evaluate() internally to score candidate prompts:

results = evaluate(
    eval_dataset,
    metrics=[self.metric],
    llm=self.llm,
    run_config=run_config,
    batch_size=batch_size,
    callbacks=callbacks,
    raise_exceptions=raise_exceptions,
    _run_id=run_id,
    _pbar=parent_pbar,
    return_executor=False,
)

This is a critical internal dependency: prompt optimization relies on the evaluation pipeline to measure fitness.

Implements

Principle:Explodinggradients_Ragas_Legacy_Evaluation_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment