Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Explodinggradients Ragas Evaluate Function

From Leeroopedia


Evaluate Function

The evaluate() function implements the Legacy Evaluation Pipeline principle in the Ragas evaluation toolkit. It provides a single-call interface for running multiple evaluation metrics across a dataset.

NOTE: This function is DEPRECATED. Use the @experiment decorator instead. See the Ragas experiment documentation for migration guidance.

Source Location

  • File: src/ragas/evaluation.py
  • evaluate() function: Lines 349-484
  • aevaluate() async function: Lines 59-345

The synchronous evaluate() is a thin wrapper around the async aevaluate(), using either nest_asyncio (for Jupyter compatibility) or asyncio.run().

Import

from ragas import evaluate

Or directly:

from ragas.evaluation import evaluate

Function Signature

@track_was_completed
def evaluate(
    dataset: Union[Dataset, EvaluationDataset],
    metrics: Optional[Sequence[Metric]] = None,
    llm: Optional[Union[BaseRagasLLM, LangchainLLM]] = None,
    embeddings: Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding, LangchainEmbeddings]] = None,
    experiment_name: Optional[str] = None,
    callbacks: Callbacks = None,
    run_config: Optional[RunConfig] = None,
    token_usage_parser: Optional[TokenUsageParser] = None,
    raise_exceptions: bool = False,
    column_map: Optional[Dict[str, str]] = None,
    show_progress: bool = True,
    batch_size: Optional[int] = None,
    _run_id: Optional[UUID] = None,
    _pbar: Optional[tqdm] = None,
    return_executor: bool = False,
    allow_nest_asyncio: bool = True,
) -> Union[EvaluationResult, Executor]

Parameters

Parameter Type Default Description
dataset Union[Dataset, EvaluationDataset] required The dataset to evaluate. Accepts both HuggingFace Dataset and Ragas EvaluationDataset.
metrics Optional[Sequence[Metric]] None List of metric instances. If None, defaults to answer_relevancy, context_precision, faithfulness, context_recall.
llm Optional[Union[BaseRagasLLM, LangchainLLM]] None LLM for metrics that require one. Falls back to gpt-4o-mini via OpenAI if not provided.
embeddings Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding, LangchainEmbeddings]] None Embedding model for metrics that require one. Inferred from the LLM provider if not provided.
experiment_name Optional[str] None Name for tracing/tracking the evaluation run.
callbacks Callbacks None LangChain callbacks for lifecycle events.
run_config Optional[RunConfig] None Runtime configuration for timeout and retries.
token_usage_parser Optional[TokenUsageParser] None Parser for extracting token usage from LLM responses. Required for cost calculation.
raise_exceptions bool False If True, raises on metric failure. If False, returns NaN for failed samples.
column_map Optional[Dict[str, str]] None Maps dataset column names to expected names (e.g., {"contexts": "contexts_v1"}).
show_progress bool True Whether to display a progress bar.
batch_size Optional[int] None Limits concurrent tasks. If None, no batching is applied.
return_executor bool False If True, returns the Executor instance for cancellable execution instead of running to completion.
allow_nest_asyncio bool True Whether to use nest_asyncio for Jupyter compatibility. Set to False in production async applications.

Return Value

  • Default (return_executor=False): Returns an EvaluationResult object containing:
    • scores -- List of dictionaries mapping metric names to scores.
    • dataset -- The original evaluation dataset.
    • binary_columns -- List of metric names that produce binary outputs.
    • cost_cb -- Cost callback handler (if token usage parser was provided).
    • traces -- Parsed execution traces.
  • With return_executor=True: Returns the Executor instance, allowing the caller to cancel execution or retrieve results later.

Deprecation Warning

Both evaluate() (line 447-453) and aevaluate() (line 105-111) emit a DeprecationWarning:

warnings.warn(
    "evaluate() is deprecated and will be removed in a future version. "
    "Use the @experiment decorator instead. "
    "See https://docs.ragas.io/en/latest/concepts/experiment/ for more information.",
    DeprecationWarning,
    stacklevel=2,
)

Internal Execution Flow

1. Input Validation and Conversion

  • Accepts both HuggingFace Dataset and Ragas EvaluationDataset.
  • Remaps column names using column_map.
  • Converts v1 dataset format to v2 if needed.
  • Validates that required columns exist and metrics are supported for the sample type.

2. Model Injection (Lines 164-200)

For each metric:

  • If the metric requires an LLM (MetricWithLLM) and none is set, the pipeline injects the provided or default LLM.
  • If the metric requires embeddings (MetricWithEmbeddings) and none is set, the pipeline infers the embedding provider from the LLM or creates a default.
  • AspectCritic metrics are identified as binary and tracked in binary_columns.
  • Each metric's init(run_config) is called.

3. Callback Setup (Lines 214-241)

  • Creates a RagasTracer for execution tracing.
  • Optionally creates a CostCallbackHandler for token usage tracking.
  • Creates a top-level evaluation chain group with nested row-level groups.

4. Task Submission (Lines 243-278)

For each sample and metric:

  • Single-turn samples: submits metric.single_turn_ascore(sample, callbacks)
  • Multi-turn samples: submits metric.multi_turn_ascore(sample, callbacks)
  • Tasks are named as {metric_name}-{sample_index} with a timeout from run_config.

5. Result Collection (Lines 284-328)

  • Collects all results from the executor.
  • Organizes results into a scores list (one dict per sample, one key per metric).
  • Handles ModeMetric instances by including the mode in the key name.
  • Constructs the final EvaluationResult with scores, dataset, traces, and cost information.

6. Cleanup (Lines 329-343)

  • Resets LLM and embedding references on metrics that were injected by the pipeline.
  • Flushes the analytics batcher.

Usage Example

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.dataset_schema import EvaluationDataset, SingleTurnSample

# Create dataset
dataset = EvaluationDataset(samples=[
    SingleTurnSample(
        user_input="What is Python?",
        response="Python is a programming language.",
        retrieved_contexts=["Python is a high-level programming language."],
    ),
])

# Run evaluation (deprecated - use @experiment instead)
result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy],
)

print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.87}

# Access per-sample scores
df = result.to_pandas()

Internal Use by Optimizers

The GeneticOptimizer.evaluate_candidate() method (lines 566-595 of genetic.py) calls evaluate() internally to score candidate prompts:

results = evaluate(
    eval_dataset,
    metrics=[self.metric],
    llm=self.llm,
    run_config=run_config,
    batch_size=batch_size,
    callbacks=callbacks,
    raise_exceptions=raise_exceptions,
    _run_id=run_id,
    _pbar=parent_pbar,
    return_executor=False,
)

This is a critical internal dependency: prompt optimization relies on the evaluation pipeline to measure fitness.

Implements

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment