Implementation:Explodinggradients Ragas Evaluate Function
Evaluate Function
The evaluate() function implements the Legacy Evaluation Pipeline principle in the Ragas evaluation toolkit. It provides a single-call interface for running multiple evaluation metrics across a dataset.
NOTE: This function is DEPRECATED. Use the @experiment decorator instead. See the Ragas experiment documentation for migration guidance.
Source Location
- File:
src/ragas/evaluation.py evaluate()function: Lines 349-484aevaluate()async function: Lines 59-345
The synchronous evaluate() is a thin wrapper around the async aevaluate(), using either nest_asyncio (for Jupyter compatibility) or asyncio.run().
Import
from ragas import evaluate
Or directly:
from ragas.evaluation import evaluate
Function Signature
@track_was_completed
def evaluate(
dataset: Union[Dataset, EvaluationDataset],
metrics: Optional[Sequence[Metric]] = None,
llm: Optional[Union[BaseRagasLLM, LangchainLLM]] = None,
embeddings: Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding, LangchainEmbeddings]] = None,
experiment_name: Optional[str] = None,
callbacks: Callbacks = None,
run_config: Optional[RunConfig] = None,
token_usage_parser: Optional[TokenUsageParser] = None,
raise_exceptions: bool = False,
column_map: Optional[Dict[str, str]] = None,
show_progress: bool = True,
batch_size: Optional[int] = None,
_run_id: Optional[UUID] = None,
_pbar: Optional[tqdm] = None,
return_executor: bool = False,
allow_nest_asyncio: bool = True,
) -> Union[EvaluationResult, Executor]
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
dataset |
Union[Dataset, EvaluationDataset] |
required | The dataset to evaluate. Accepts both HuggingFace Dataset and Ragas EvaluationDataset.
|
metrics |
Optional[Sequence[Metric]] |
None |
List of metric instances. If None, defaults to answer_relevancy, context_precision, faithfulness, context_recall.
|
llm |
Optional[Union[BaseRagasLLM, LangchainLLM]] |
None |
LLM for metrics that require one. Falls back to gpt-4o-mini via OpenAI if not provided.
|
embeddings |
Optional[Union[BaseRagasEmbeddings, BaseRagasEmbedding, LangchainEmbeddings]] |
None |
Embedding model for metrics that require one. Inferred from the LLM provider if not provided. |
experiment_name |
Optional[str] |
None |
Name for tracing/tracking the evaluation run. |
callbacks |
Callbacks |
None |
LangChain callbacks for lifecycle events. |
run_config |
Optional[RunConfig] |
None |
Runtime configuration for timeout and retries. |
token_usage_parser |
Optional[TokenUsageParser] |
None |
Parser for extracting token usage from LLM responses. Required for cost calculation. |
raise_exceptions |
bool |
False |
If True, raises on metric failure. If False, returns NaN for failed samples.
|
column_map |
Optional[Dict[str, str]] |
None |
Maps dataset column names to expected names (e.g., {"contexts": "contexts_v1"}).
|
show_progress |
bool |
True |
Whether to display a progress bar. |
batch_size |
Optional[int] |
None |
Limits concurrent tasks. If None, no batching is applied.
|
return_executor |
bool |
False |
If True, returns the Executor instance for cancellable execution instead of running to completion.
|
allow_nest_asyncio |
bool |
True |
Whether to use nest_asyncio for Jupyter compatibility. Set to False in production async applications.
|
Return Value
- Default (
return_executor=False): Returns anEvaluationResultobject containing:scores-- List of dictionaries mapping metric names to scores.dataset-- The original evaluation dataset.binary_columns-- List of metric names that produce binary outputs.cost_cb-- Cost callback handler (if token usage parser was provided).traces-- Parsed execution traces.
- With
return_executor=True: Returns theExecutorinstance, allowing the caller to cancel execution or retrieve results later.
Deprecation Warning
Both evaluate() (line 447-453) and aevaluate() (line 105-111) emit a DeprecationWarning:
warnings.warn(
"evaluate() is deprecated and will be removed in a future version. "
"Use the @experiment decorator instead. "
"See https://docs.ragas.io/en/latest/concepts/experiment/ for more information.",
DeprecationWarning,
stacklevel=2,
)
Internal Execution Flow
1. Input Validation and Conversion
- Accepts both HuggingFace
Datasetand RagasEvaluationDataset. - Remaps column names using
column_map. - Converts v1 dataset format to v2 if needed.
- Validates that required columns exist and metrics are supported for the sample type.
2. Model Injection (Lines 164-200)
For each metric:
- If the metric requires an LLM (
MetricWithLLM) and none is set, the pipeline injects the provided or default LLM. - If the metric requires embeddings (
MetricWithEmbeddings) and none is set, the pipeline infers the embedding provider from the LLM or creates a default. AspectCriticmetrics are identified as binary and tracked inbinary_columns.- Each metric's
init(run_config)is called.
3. Callback Setup (Lines 214-241)
- Creates a
RagasTracerfor execution tracing. - Optionally creates a
CostCallbackHandlerfor token usage tracking. - Creates a top-level evaluation chain group with nested row-level groups.
4. Task Submission (Lines 243-278)
For each sample and metric:
- Single-turn samples: submits
metric.single_turn_ascore(sample, callbacks) - Multi-turn samples: submits
metric.multi_turn_ascore(sample, callbacks) - Tasks are named as
{metric_name}-{sample_index}with a timeout fromrun_config.
5. Result Collection (Lines 284-328)
- Collects all results from the executor.
- Organizes results into a scores list (one dict per sample, one key per metric).
- Handles
ModeMetricinstances by including the mode in the key name. - Constructs the final
EvaluationResultwith scores, dataset, traces, and cost information.
6. Cleanup (Lines 329-343)
- Resets LLM and embedding references on metrics that were injected by the pipeline.
- Flushes the analytics batcher.
Usage Example
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.dataset_schema import EvaluationDataset, SingleTurnSample
# Create dataset
dataset = EvaluationDataset(samples=[
SingleTurnSample(
user_input="What is Python?",
response="Python is a programming language.",
retrieved_contexts=["Python is a high-level programming language."],
),
])
# Run evaluation (deprecated - use @experiment instead)
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy],
)
print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.87}
# Access per-sample scores
df = result.to_pandas()
Internal Use by Optimizers
The GeneticOptimizer.evaluate_candidate() method (lines 566-595 of genetic.py) calls evaluate() internally to score candidate prompts:
results = evaluate(
eval_dataset,
metrics=[self.metric],
llm=self.llm,
run_config=run_config,
batch_size=batch_size,
callbacks=callbacks,
raise_exceptions=raise_exceptions,
_run_id=run_id,
_pbar=parent_pbar,
return_executor=False,
)
This is a critical internal dependency: prompt optimization relies on the evaluation pipeline to measure fitness.
Implements
See Also
- GeneticOptimizer Class -- Uses evaluate() for candidate fitness evaluation.
- MetricAnnotation Class -- Annotation data converted to EvaluationDataset for evaluation.
- Loss Classes -- Applied to evaluation results during optimization.
- Environment:Explodinggradients_Ragas_Python_Runtime_Environment
- Heuristic:Explodinggradients_Ragas_Retry_And_Backoff_Configuration
- Heuristic:Explodinggradients_Ragas_Concurrency_And_Rate_Limiting
- Heuristic:Explodinggradients_Ragas_Failed_Metrics_Return_NaN
- Heuristic:Explodinggradients_Ragas_Deprecation_Migration_Guide