Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Transformers Trainer Evaluate

From Leeroopedia
Knowledge Sources
Domains NLP, Training, Model Validation
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for running model evaluation on a dataset and returning computed metrics, provided by the HuggingFace Transformers library.

Description

Trainer.evaluate() runs the model in inference mode over an evaluation dataset and returns a dictionary of metrics. It constructs an evaluation DataLoader, runs the evaluation_loop (shared with predict()), computes speed metrics, logs results, and fires the on_evaluate callback.

Key behaviors include:

  • Multiple dataset support -- When passed a dictionary of datasets, it evaluates on each one separately and prefixes metrics with the dataset name.
  • Metric delegation -- If a compute_metrics function was provided at Trainer initialization, predictions and labels are passed to it for task-specific metric computation.
  • Memory tracking -- Memory usage metrics are collected and included in the output.
  • Speed metrics -- Samples per second and steps per second are calculated and reported.
  • Callback integration -- The on_evaluate callback hook is fired after evaluation completes, enabling custom logging or early stopping logic.

Usage

Call trainer.evaluate() after training to get final metrics, or rely on the Trainer to call it automatically during training when eval_strategy is set to "steps" or "epoch". You can also call it independently at any time to evaluate the current model state.

Code Reference

Source Location

  • Repository: transformers
  • File: src/transformers/trainer.py (lines 2492-2591)

Signature

def evaluate(
    self,
    eval_dataset: Dataset | dict[str, Dataset] | None = None,
    ignore_keys: list[str] | None = None,
    metric_key_prefix: str = "eval",
) -> dict[str, float]:

Import

from transformers import Trainer
# evaluate() is an instance method called on a Trainer object

I/O Contract

Inputs

Name Type Required Description
eval_dataset Dataset or dict[str, Dataset] No Dataset to evaluate on. If None, uses the eval_dataset provided during Trainer initialization. If a dict, evaluates on each dataset separately with prefixed metric names
ignore_keys list[str] No Keys in the model output dictionary to exclude when gathering predictions
metric_key_prefix str No Prefix for metric names in the returned dictionary (default: "eval"). For example, loss becomes "eval_loss"

Outputs

Name Type Description
metrics dict[str, float] Dictionary containing evaluation metrics. Always includes eval_loss, eval_runtime, eval_samples_per_second, and eval_steps_per_second. Additional metrics depend on the compute_metrics function provided during initialization

Usage Examples

Basic Usage

# Evaluate using the eval_dataset from initialization
metrics = trainer.evaluate()
print(f"Eval loss: {metrics['eval_loss']:.4f}")
print(f"Eval samples/sec: {metrics['eval_samples_per_second']:.1f}")

Evaluating on a Custom Dataset

from datasets import load_dataset

test_dataset = load_dataset("imdb", split="test")
tokenized_test = test_dataset.map(tokenize_fn, batched=True)

metrics = trainer.evaluate(eval_dataset=tokenized_test, metric_key_prefix="test")
print(f"Test loss: {metrics['test_loss']:.4f}")

Evaluating on Multiple Datasets

metrics = trainer.evaluate(
    eval_dataset={
        "validation": val_dataset,
        "test": test_dataset,
    }
)
# Returns metrics like:
# {"eval_validation_loss": 0.42, "eval_test_loss": 0.45, ...}

With Custom Metrics

import numpy as np
from transformers import Trainer, TrainingArguments

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = (predictions == labels).mean()
    return {"accuracy": accuracy}

trainer = Trainer(
    model=model,
    args=args,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

metrics = trainer.evaluate()
print(f"Accuracy: {metrics['eval_accuracy']:.4f}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment