Implementation:Huggingface Trl DPOTrainer Evaluate Save

Knowledge Sources	TRL TRL Docs
Domains	NLP, RLHF
Last Updated	2026-02-06 17:00 GMT

Overview

Concrete tool for evaluating a DPO-trained model with preference-specific metrics and saving the trained weights, provided by the TRL library.

Description

The DPOTrainer provides an overridden evaluation_loop method that extends the standard Trainer evaluation with DPO-specific functionality:

Optional generation sampling: When generate_during_eval=True, the evaluation loop selects a random batch from the evaluation dataset and generates completions from both the policy model and the reference model. These generations are logged as a table to the configured experiment tracker (Weights and Biases, Comet ML, or MLflow) for qualitative analysis.

Standard evaluation: The parent class's evaluation_loop is called, which iterates over the evaluation dataloader calling prediction_step for each batch. The prediction_step method calls get_batch_loss_metrics with train_eval="eval" to compute the DPO loss and all preference metrics.

Metric aggregation: The store_metrics method accumulates per-batch metrics, and the log method averages them before sending to the experiment tracker.

The save_model method (inherited from the Transformers Trainer) saves model weights, tokenizer, and training arguments to the output directory. For PEFT models, only the adapter weights are saved. The push_to_hub method uploads the saved model to the Hugging Face Hub. A model card is automatically generated and saved alongside each checkpoint via the overridden _save_checkpoint method.

The DPO training script (trl/scripts/dpo.py) orchestrates the final evaluation and saving sequence: after trainer.train() completes, it runs trainer.evaluate() (if eval_strategy is not "no"), logs and saves eval metrics, saves the model to the output directory, and optionally pushes to the Hub.

Usage

Use the evaluation and saving functionality when:

Running periodic evaluation during DPO training to monitor alignment quality
Computing final evaluation metrics after training completes
Saving the trained model for deployment
Comparing reward margins and accuracies across different configurations
Publishing aligned models to the Hugging Face Hub
Inspecting generated samples from policy vs. reference model

Code Reference

Source Location

Repository: TRL
File (evaluation_loop): trl/trainer/dpo_trainer.py (lines 1942-1996)
File (prediction_step): trl/trainer/dpo_trainer.py (lines 1901-1936)
File (store_metrics): trl/trainer/dpo_trainer.py (lines 1938-1940)
File (log): trl/trainer/dpo_trainer.py (lines 1998-2014)
File (script orchestration): trl/scripts/dpo.py (lines 153-164)

Signature

class DPOTrainer(BaseTrainer):

    def evaluation_loop(
        self,
        dataloader: DataLoader,
        description: str,
        prediction_loss_only: bool | None = None,
        ignore_keys: list[str] | None = None,
        metric_key_prefix: str = "eval",
    ) -> EvalLoopOutput:
        """
        Overriding built-in evaluation loop to store metrics for each batch.
        Optionally generates samples from policy and reference models.
        """

    def prediction_step(
        self,
        model: PreTrainedModel | nn.Module,
        inputs: dict[str, torch.Tensor | Any],
        prediction_loss_only: bool,
        ignore_keys: list[str] | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor | None, torch.Tensor | None]:
        """Compute eval loss and metrics for a single batch."""

    def store_metrics(
        self,
        metrics: dict[str, float],
        train_eval: Literal["train", "eval"] = "train",
    ) -> None:
        """Store per-batch metrics for later aggregation."""

    def save_model(
        self, output_dir: str | None = None
    ) -> None:  # inherited from Trainer

    def push_to_hub(self, **kwargs) -> str:  # inherited from Trainer

Import

# Methods are accessed on a DPOTrainer instance
from trl import DPOTrainer

# Evaluation output type
from transformers.trainer_utils import EvalLoopOutput

I/O Contract

Inputs

Name	Type	Required	Description
trainer (instance)	`DPOTrainer`	Yes	A trained DPOTrainer instance (after calling `.train()`)
eval_dataset	`Dataset or None`	No	Evaluation dataset; uses the one provided during initialization if not specified
output_dir	`str or None`	No	Directory to save model weights; defaults to `args.output_dir`
metric_key_prefix	`str`	No (default: "eval")	Prefix for metric keys in the output dictionary

Outputs

Name	Type	Description
eval_loss	`float`	Mean DPO loss over the evaluation set
eval_rewards/chosen	`float`	Mean implicit reward for chosen responses
eval_rewards/rejected	`float`	Mean implicit reward for rejected responses
eval_rewards/margins	`float`	Mean reward margin (chosen minus rejected)
eval_rewards/accuracies	`float`	Fraction of samples where chosen reward exceeds rejected reward
eval_logps/chosen	`float`	Mean log probability of chosen responses under the policy
eval_logps/rejected	`float`	Mean log probability of rejected responses under the policy
eval_logits/chosen	`float`	Mean logit values for chosen responses
eval_logits/rejected	`float`	Mean logit values for rejected responses
saved model	`files on disk`	Model weights, tokenizer, config, and model card saved to output_dir

Usage Examples

# Example 1: Evaluate after training and save
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)

# Train the model
trainer.train()

# Evaluate
metrics = trainer.evaluate()
print(f"Reward margin: {metrics['eval_rewards/margins']:.4f}")
print(f"Reward accuracy: {metrics['eval_rewards/accuracies']:.4f}")

# Log and save metrics
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

# Save the trained model
trainer.save_model("./dpo-trained-model")

# Example 2: Push to the Hugging Face Hub
trainer.push_to_hub(
    dataset_name="trl-lib/ultrafeedback_binarized",
)
# Model is now available at https://huggingface.co/{hub_model_id}

# Example 3: Periodic evaluation during training via DPOConfig
training_args = DPOConfig(
    output_dir="./dpo-output",
    eval_strategy="steps",
    eval_steps=50,
    per_device_eval_batch_size=4,
    # Metrics are logged every eval_steps
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)
trainer.train()
# eval_rewards/margins is logged every 50 steps

# Example 4: Full DPO script post-training pattern (from trl/scripts/dpo.py)
# Train the model
trainer.train()

if training_args.eval_strategy != "no":
    metrics = trainer.evaluate()
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

# Save and push to Hub
trainer.save_model(training_args.output_dir)

if training_args.push_to_hub:
    trainer.push_to_hub(dataset_name=script_args.dataset_name)

Related Pages

Implements Principle

Principle:Huggingface_Trl_DPO_Evaluation_and_Saving

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment