Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Trl DPOTrainer Evaluate Save

From Leeroopedia


Knowledge Sources
Domains NLP, RLHF
Last Updated 2026-02-06 17:00 GMT

Overview

Concrete tool for evaluating a DPO-trained model with preference-specific metrics and saving the trained weights, provided by the TRL library.

Description

The DPOTrainer provides an overridden evaluation_loop method that extends the standard Trainer evaluation with DPO-specific functionality:

  1. Optional generation sampling: When generate_during_eval=True, the evaluation loop selects a random batch from the evaluation dataset and generates completions from both the policy model and the reference model. These generations are logged as a table to the configured experiment tracker (Weights and Biases, Comet ML, or MLflow) for qualitative analysis.
  1. Standard evaluation: The parent class's evaluation_loop is called, which iterates over the evaluation dataloader calling prediction_step for each batch. The prediction_step method calls get_batch_loss_metrics with train_eval="eval" to compute the DPO loss and all preference metrics.
  1. Metric aggregation: The store_metrics method accumulates per-batch metrics, and the log method averages them before sending to the experiment tracker.

The save_model method (inherited from the Transformers Trainer) saves model weights, tokenizer, and training arguments to the output directory. For PEFT models, only the adapter weights are saved. The push_to_hub method uploads the saved model to the Hugging Face Hub. A model card is automatically generated and saved alongside each checkpoint via the overridden _save_checkpoint method.

The DPO training script (trl/scripts/dpo.py) orchestrates the final evaluation and saving sequence: after trainer.train() completes, it runs trainer.evaluate() (if eval_strategy is not "no"), logs and saves eval metrics, saves the model to the output directory, and optionally pushes to the Hub.

Usage

Use the evaluation and saving functionality when:

  • Running periodic evaluation during DPO training to monitor alignment quality
  • Computing final evaluation metrics after training completes
  • Saving the trained model for deployment
  • Comparing reward margins and accuracies across different configurations
  • Publishing aligned models to the Hugging Face Hub
  • Inspecting generated samples from policy vs. reference model

Code Reference

Source Location

  • Repository: TRL
  • File (evaluation_loop): trl/trainer/dpo_trainer.py (lines 1942-1996)
  • File (prediction_step): trl/trainer/dpo_trainer.py (lines 1901-1936)
  • File (store_metrics): trl/trainer/dpo_trainer.py (lines 1938-1940)
  • File (log): trl/trainer/dpo_trainer.py (lines 1998-2014)
  • File (script orchestration): trl/scripts/dpo.py (lines 153-164)

Signature

class DPOTrainer(BaseTrainer):

    def evaluation_loop(
        self,
        dataloader: DataLoader,
        description: str,
        prediction_loss_only: bool | None = None,
        ignore_keys: list[str] | None = None,
        metric_key_prefix: str = "eval",
    ) -> EvalLoopOutput:
        """
        Overriding built-in evaluation loop to store metrics for each batch.
        Optionally generates samples from policy and reference models.
        """

    def prediction_step(
        self,
        model: PreTrainedModel | nn.Module,
        inputs: dict[str, torch.Tensor | Any],
        prediction_loss_only: bool,
        ignore_keys: list[str] | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor | None, torch.Tensor | None]:
        """Compute eval loss and metrics for a single batch."""

    def store_metrics(
        self,
        metrics: dict[str, float],
        train_eval: Literal["train", "eval"] = "train",
    ) -> None:
        """Store per-batch metrics for later aggregation."""

    def save_model(
        self, output_dir: str | None = None
    ) -> None:  # inherited from Trainer

    def push_to_hub(self, **kwargs) -> str:  # inherited from Trainer

Import

# Methods are accessed on a DPOTrainer instance
from trl import DPOTrainer

# Evaluation output type
from transformers.trainer_utils import EvalLoopOutput

I/O Contract

Inputs

Name Type Required Description
trainer (instance) DPOTrainer Yes A trained DPOTrainer instance (after calling .train())
eval_dataset Dataset or None No Evaluation dataset; uses the one provided during initialization if not specified
output_dir str or None No Directory to save model weights; defaults to args.output_dir
metric_key_prefix str No (default: "eval") Prefix for metric keys in the output dictionary

Outputs

Name Type Description
eval_loss float Mean DPO loss over the evaluation set
eval_rewards/chosen float Mean implicit reward for chosen responses
eval_rewards/rejected float Mean implicit reward for rejected responses
eval_rewards/margins float Mean reward margin (chosen minus rejected)
eval_rewards/accuracies float Fraction of samples where chosen reward exceeds rejected reward
eval_logps/chosen float Mean log probability of chosen responses under the policy
eval_logps/rejected float Mean log probability of rejected responses under the policy
eval_logits/chosen float Mean logit values for chosen responses
eval_logits/rejected float Mean logit values for rejected responses
saved model files on disk Model weights, tokenizer, config, and model card saved to output_dir

Usage Examples

# Example 1: Evaluate after training and save
trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)

# Train the model
trainer.train()

# Evaluate
metrics = trainer.evaluate()
print(f"Reward margin: {metrics['eval_rewards/margins']:.4f}")
print(f"Reward accuracy: {metrics['eval_rewards/accuracies']:.4f}")

# Log and save metrics
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

# Save the trained model
trainer.save_model("./dpo-trained-model")
# Example 2: Push to the Hugging Face Hub
trainer.push_to_hub(
    dataset_name="trl-lib/ultrafeedback_binarized",
)
# Model is now available at https://huggingface.co/{hub_model_id}
# Example 3: Periodic evaluation during training via DPOConfig
training_args = DPOConfig(
    output_dir="./dpo-output",
    eval_strategy="steps",
    eval_steps=50,
    per_device_eval_batch_size=4,
    # Metrics are logged every eval_steps
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)
trainer.train()
# eval_rewards/margins is logged every 50 steps
# Example 4: Full DPO script post-training pattern (from trl/scripts/dpo.py)
# Train the model
trainer.train()

if training_args.eval_strategy != "no":
    metrics = trainer.evaluate()
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)

# Save and push to Hub
trainer.save_model(training_args.output_dir)

if training_args.push_to_hub:
    trainer.push_to_hub(dataset_name=script_args.dataset_name)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment