Implementation:OpenRLHF OpenRLHF DPOTrainer
| Knowledge Sources | |
|---|---|
| Domains | NLP, Alignment, Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for direct preference optimization training provided by OpenRLHF.
Description
The DPOTrainer class implements the DPO training loop. It concatenates chosen and rejected sequences for efficient single-pass forward computation through both the policy and reference models, computes sequence-level log-probabilities, applies the DPOLoss (with IPO and label smoothing variants), and tracks chosen/rejected implicit reward margins. The reference model is kept frozen throughout training.
Usage
Instantiate with policy model, frozen reference model, optimizer, preference dataloaders, and call fit(). Used in DPO Training and Iterative DPO workflows.
Code Reference
Source Location
- Repository: OpenRLHF
- File: openrlhf/trainer/dpo_trainer.py
- Lines: L12-371 (class), L32-105 (__init__), L106-205 (fit)
Signature
class DPOTrainer(ABC):
def __init__(
self,
model, # Actor: policy model to train
ref_model, # Actor: frozen reference model
strategy, # DeepspeedStrategy
tokenizer, # tokenizer for padding
optim: Optimizer, # optimizer
train_dataloader, # training DataLoader (RewardDataset is_dpo=True)
eval_dataloader, # evaluation DataLoader
scheduler, # learning rate scheduler
max_norm=0.5, # gradient clipping norm
beta=0.01, # DPO regularization coefficient
max_epochs: int = 2, # training epochs
save_hf_ckpt: bool = False, # save HF format checkpoints
disable_ds_ckpt: bool = False,
) -> None:
def fit(self, args, consumed_samples=0, num_update_steps_per_epoch=None):
"""Run the full DPO training loop."""
Import
from openrlhf.trainer import DPOTrainer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | Actor | Yes | Policy model to align |
| ref_model | Actor | Yes | Frozen reference model |
| beta | float | No | DPO coefficient (default 0.01) |
| train_dataloader | DataLoader | Yes | Preference data (RewardDataset is_dpo=True) |
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | None | Policy model aligned in-place |
| logs | Dict | DPO loss, accuracy, chosen/rejected rewards |
Usage Examples
from openrlhf.trainer import DPOTrainer
trainer = DPOTrainer(
model=policy_model,
ref_model=ref_model,
strategy=strategy,
tokenizer=tokenizer,
optim=optimizer,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
scheduler=scheduler,
beta=args.beta,
max_epochs=args.max_epochs,
)
trainer.fit(args, num_update_steps_per_epoch=num_update_steps_per_epoch)