Implementation:Huggingface Transformers Trainer Evaluate
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Model Validation |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for running model evaluation on a dataset and returning computed metrics, provided by the HuggingFace Transformers library.
Description
Trainer.evaluate() runs the model in inference mode over an evaluation dataset and returns a dictionary of metrics. It constructs an evaluation DataLoader, runs the evaluation_loop (shared with predict()), computes speed metrics, logs results, and fires the on_evaluate callback.
Key behaviors include:
- Multiple dataset support -- When passed a dictionary of datasets, it evaluates on each one separately and prefixes metrics with the dataset name.
- Metric delegation -- If a compute_metrics function was provided at Trainer initialization, predictions and labels are passed to it for task-specific metric computation.
- Memory tracking -- Memory usage metrics are collected and included in the output.
- Speed metrics -- Samples per second and steps per second are calculated and reported.
- Callback integration -- The on_evaluate callback hook is fired after evaluation completes, enabling custom logging or early stopping logic.
Usage
Call trainer.evaluate() after training to get final metrics, or rely on the Trainer to call it automatically during training when eval_strategy is set to "steps" or "epoch". You can also call it independently at any time to evaluate the current model state.
Code Reference
Source Location
- Repository: transformers
- File: src/transformers/trainer.py (lines 2492-2591)
Signature
def evaluate(
self,
eval_dataset: Dataset | dict[str, Dataset] | None = None,
ignore_keys: list[str] | None = None,
metric_key_prefix: str = "eval",
) -> dict[str, float]:
Import
from transformers import Trainer
# evaluate() is an instance method called on a Trainer object
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| eval_dataset | Dataset or dict[str, Dataset] | No | Dataset to evaluate on. If None, uses the eval_dataset provided during Trainer initialization. If a dict, evaluates on each dataset separately with prefixed metric names |
| ignore_keys | list[str] | No | Keys in the model output dictionary to exclude when gathering predictions |
| metric_key_prefix | str | No | Prefix for metric names in the returned dictionary (default: "eval"). For example, loss becomes "eval_loss" |
Outputs
| Name | Type | Description |
|---|---|---|
| metrics | dict[str, float] | Dictionary containing evaluation metrics. Always includes eval_loss, eval_runtime, eval_samples_per_second, and eval_steps_per_second. Additional metrics depend on the compute_metrics function provided during initialization |
Usage Examples
Basic Usage
# Evaluate using the eval_dataset from initialization
metrics = trainer.evaluate()
print(f"Eval loss: {metrics['eval_loss']:.4f}")
print(f"Eval samples/sec: {metrics['eval_samples_per_second']:.1f}")
Evaluating on a Custom Dataset
from datasets import load_dataset
test_dataset = load_dataset("imdb", split="test")
tokenized_test = test_dataset.map(tokenize_fn, batched=True)
metrics = trainer.evaluate(eval_dataset=tokenized_test, metric_key_prefix="test")
print(f"Test loss: {metrics['test_loss']:.4f}")
Evaluating on Multiple Datasets
metrics = trainer.evaluate(
eval_dataset={
"validation": val_dataset,
"test": test_dataset,
}
)
# Returns metrics like:
# {"eval_validation_loss": 0.42, "eval_test_loss": 0.45, ...}
With Custom Metrics
import numpy as np
from transformers import Trainer, TrainingArguments
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = (predictions == labels).mean()
return {"accuracy": accuracy}
trainer = Trainer(
model=model,
args=args,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
metrics = trainer.evaluate()
print(f"Accuracy: {metrics['eval_accuracy']:.4f}")