Implementation:Huggingface Transformers Trainer Evaluate

Knowledge Sources	Transformers Transformers Docs
Domains	NLP, Training, Model Validation
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for running model evaluation on a dataset and returning computed metrics, provided by the HuggingFace Transformers library.

Description

Trainer.evaluate() runs the model in inference mode over an evaluation dataset and returns a dictionary of metrics. It constructs an evaluation DataLoader, runs the evaluation_loop (shared with predict()), computes speed metrics, logs results, and fires the on_evaluate callback.

Key behaviors include:

Multiple dataset support -- When passed a dictionary of datasets, it evaluates on each one separately and prefixes metrics with the dataset name.
Metric delegation -- If a compute_metrics function was provided at Trainer initialization, predictions and labels are passed to it for task-specific metric computation.
Memory tracking -- Memory usage metrics are collected and included in the output.
Speed metrics -- Samples per second and steps per second are calculated and reported.
Callback integration -- The on_evaluate callback hook is fired after evaluation completes, enabling custom logging or early stopping logic.

Usage

Call trainer.evaluate() after training to get final metrics, or rely on the Trainer to call it automatically during training when eval_strategy is set to "steps" or "epoch". You can also call it independently at any time to evaluate the current model state.

Code Reference

Source Location

Repository: transformers
File: src/transformers/trainer.py (lines 2492-2591)

Signature

def evaluate(
    self,
    eval_dataset: Dataset | dict[str, Dataset] | None = None,
    ignore_keys: list[str] | None = None,
    metric_key_prefix: str = "eval",
) -> dict[str, float]:

Import

from transformers import Trainer
# evaluate() is an instance method called on a Trainer object

I/O Contract

Inputs

Name	Type	Required	Description
eval_dataset	Dataset or dict[str, Dataset]	No	Dataset to evaluate on. If None, uses the eval_dataset provided during Trainer initialization. If a dict, evaluates on each dataset separately with prefixed metric names
ignore_keys	list[str]	No	Keys in the model output dictionary to exclude when gathering predictions
metric_key_prefix	str	No	Prefix for metric names in the returned dictionary (default: "eval"). For example, loss becomes "eval_loss"

Outputs

Name	Type	Description
metrics	dict[str, float]	Dictionary containing evaluation metrics. Always includes eval_loss, eval_runtime, eval_samples_per_second, and eval_steps_per_second. Additional metrics depend on the compute_metrics function provided during initialization

Usage Examples

Basic Usage

# Evaluate using the eval_dataset from initialization
metrics = trainer.evaluate()
print(f"Eval loss: {metrics['eval_loss']:.4f}")
print(f"Eval samples/sec: {metrics['eval_samples_per_second']:.1f}")

Evaluating on a Custom Dataset

from datasets import load_dataset

test_dataset = load_dataset("imdb", split="test")
tokenized_test = test_dataset.map(tokenize_fn, batched=True)

metrics = trainer.evaluate(eval_dataset=tokenized_test, metric_key_prefix="test")
print(f"Test loss: {metrics['test_loss']:.4f}")

Evaluating on Multiple Datasets

metrics = trainer.evaluate(
    eval_dataset={
        "validation": val_dataset,
        "test": test_dataset,
    }
)
# Returns metrics like:
# {"eval_validation_loss": 0.42, "eval_test_loss": 0.45, ...}

With Custom Metrics

import numpy as np
from transformers import Trainer, TrainingArguments

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = (predictions == labels).mean()
    return {"accuracy": accuracy}

trainer = Trainer(
    model=model,
    args=args,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

metrics = trainer.evaluate()
print(f"Accuracy: {metrics['eval_accuracy']:.4f}")

Related Pages

Implements Principle

Principle:Huggingface_Transformers_Model_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment