Implementation:Huggingface Trl DPOTrainer Evaluate Save
| Knowledge Sources | |
|---|---|
| Domains | NLP, RLHF |
| Last Updated | 2026-02-06 17:00 GMT |
Overview
Concrete tool for evaluating a DPO-trained model with preference-specific metrics and saving the trained weights, provided by the TRL library.
Description
The DPOTrainer provides an overridden evaluation_loop method that extends the standard Trainer evaluation with DPO-specific functionality:
- Optional generation sampling: When
generate_during_eval=True, the evaluation loop selects a random batch from the evaluation dataset and generates completions from both the policy model and the reference model. These generations are logged as a table to the configured experiment tracker (Weights and Biases, Comet ML, or MLflow) for qualitative analysis.
- Standard evaluation: The parent class's
evaluation_loopis called, which iterates over the evaluation dataloader callingprediction_stepfor each batch. Theprediction_stepmethod callsget_batch_loss_metricswithtrain_eval="eval"to compute the DPO loss and all preference metrics.
- Metric aggregation: The
store_metricsmethod accumulates per-batch metrics, and thelogmethod averages them before sending to the experiment tracker.
The save_model method (inherited from the Transformers Trainer) saves model weights, tokenizer, and training arguments to the output directory. For PEFT models, only the adapter weights are saved. The push_to_hub method uploads the saved model to the Hugging Face Hub. A model card is automatically generated and saved alongside each checkpoint via the overridden _save_checkpoint method.
The DPO training script (trl/scripts/dpo.py) orchestrates the final evaluation and saving sequence: after trainer.train() completes, it runs trainer.evaluate() (if eval_strategy is not "no"), logs and saves eval metrics, saves the model to the output directory, and optionally pushes to the Hub.
Usage
Use the evaluation and saving functionality when:
- Running periodic evaluation during DPO training to monitor alignment quality
- Computing final evaluation metrics after training completes
- Saving the trained model for deployment
- Comparing reward margins and accuracies across different configurations
- Publishing aligned models to the Hugging Face Hub
- Inspecting generated samples from policy vs. reference model
Code Reference
Source Location
- Repository: TRL
- File (evaluation_loop):
trl/trainer/dpo_trainer.py(lines 1942-1996) - File (prediction_step):
trl/trainer/dpo_trainer.py(lines 1901-1936) - File (store_metrics):
trl/trainer/dpo_trainer.py(lines 1938-1940) - File (log):
trl/trainer/dpo_trainer.py(lines 1998-2014) - File (script orchestration):
trl/scripts/dpo.py(lines 153-164)
Signature
class DPOTrainer(BaseTrainer):
def evaluation_loop(
self,
dataloader: DataLoader,
description: str,
prediction_loss_only: bool | None = None,
ignore_keys: list[str] | None = None,
metric_key_prefix: str = "eval",
) -> EvalLoopOutput:
"""
Overriding built-in evaluation loop to store metrics for each batch.
Optionally generates samples from policy and reference models.
"""
def prediction_step(
self,
model: PreTrainedModel | nn.Module,
inputs: dict[str, torch.Tensor | Any],
prediction_loss_only: bool,
ignore_keys: list[str] | None = None,
) -> tuple[torch.Tensor, torch.Tensor | None, torch.Tensor | None]:
"""Compute eval loss and metrics for a single batch."""
def store_metrics(
self,
metrics: dict[str, float],
train_eval: Literal["train", "eval"] = "train",
) -> None:
"""Store per-batch metrics for later aggregation."""
def save_model(
self, output_dir: str | None = None
) -> None: # inherited from Trainer
def push_to_hub(self, **kwargs) -> str: # inherited from Trainer
Import
# Methods are accessed on a DPOTrainer instance
from trl import DPOTrainer
# Evaluation output type
from transformers.trainer_utils import EvalLoopOutput
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| trainer (instance) | DPOTrainer |
Yes | A trained DPOTrainer instance (after calling .train())
|
| eval_dataset | Dataset or None |
No | Evaluation dataset; uses the one provided during initialization if not specified |
| output_dir | str or None |
No | Directory to save model weights; defaults to args.output_dir
|
| metric_key_prefix | str |
No (default: "eval") | Prefix for metric keys in the output dictionary |
Outputs
| Name | Type | Description |
|---|---|---|
| eval_loss | float |
Mean DPO loss over the evaluation set |
| eval_rewards/chosen | float |
Mean implicit reward for chosen responses |
| eval_rewards/rejected | float |
Mean implicit reward for rejected responses |
| eval_rewards/margins | float |
Mean reward margin (chosen minus rejected) |
| eval_rewards/accuracies | float |
Fraction of samples where chosen reward exceeds rejected reward |
| eval_logps/chosen | float |
Mean log probability of chosen responses under the policy |
| eval_logps/rejected | float |
Mean log probability of rejected responses under the policy |
| eval_logits/chosen | float |
Mean logit values for chosen responses |
| eval_logits/rejected | float |
Mean logit values for rejected responses |
| saved model | files on disk |
Model weights, tokenizer, config, and model card saved to output_dir |
Usage Examples
# Example 1: Evaluate after training and save
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
)
# Train the model
trainer.train()
# Evaluate
metrics = trainer.evaluate()
print(f"Reward margin: {metrics['eval_rewards/margins']:.4f}")
print(f"Reward accuracy: {metrics['eval_rewards/accuracies']:.4f}")
# Log and save metrics
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
# Save the trained model
trainer.save_model("./dpo-trained-model")
# Example 2: Push to the Hugging Face Hub
trainer.push_to_hub(
dataset_name="trl-lib/ultrafeedback_binarized",
)
# Model is now available at https://huggingface.co/{hub_model_id}
# Example 3: Periodic evaluation during training via DPOConfig
training_args = DPOConfig(
output_dir="./dpo-output",
eval_strategy="steps",
eval_steps=50,
per_device_eval_batch_size=4,
# Metrics are logged every eval_steps
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
)
trainer.train()
# eval_rewards/margins is logged every 50 steps
# Example 4: Full DPO script post-training pattern (from trl/scripts/dpo.py)
# Train the model
trainer.train()
if training_args.eval_strategy != "no":
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
# Save and push to Hub
trainer.save_model(training_args.output_dir)
if training_args.push_to_hub:
trainer.push_to_hub(dataset_name=script_args.dataset_name)