Workflow:OpenRLHF OpenRLHF DPO Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RLHF, DPO, Alignment |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
End-to-end process for aligning a language model using Direct Preference Optimization (DPO) on human preference pairs without training a separate reward model.
Description
This workflow implements offline preference alignment using DPO, which directly optimizes the policy model on preference pairs without requiring a separate reward model or RL training loop. It loads both a trainable policy model and a frozen reference model, then trains the policy to increase the likelihood gap between chosen and rejected responses relative to the reference model. Variants include IPO (Identity Preference Optimization) and cDPO (with label smoothing). The approach is simpler and more stable than PPO but operates on a fixed offline dataset.
Usage
Execute this workflow when you have a preference dataset (chosen/rejected pairs) and want to align your model without the complexity of training a reward model and running PPO. DPO is suitable when you have high-quality static preference data and do not need online data generation. It is simpler to implement and tune than PPO, though it may be less effective when iterating on data.
Execution Steps
Step 1: Configure distributed strategy
Initialize the DeepSpeed training strategy with ZeRO-3 parallelism. Configure precision settings and gradient accumulation for handling the memory requirements of loading two models simultaneously (policy and reference).
Key considerations:
- Two full models must fit in memory (policy + reference)
- Reference model can be offloaded to CPU to reduce GPU memory pressure
- ZeRO-3 shards both models across GPUs
Step 2: Load policy and reference models
Load the policy model (the model being trained) and the reference model (a frozen copy for computing the implicit reward). Both are typically initialized from the same SFT checkpoint. The reference model is set to evaluation mode and its gradients are disabled.
Key considerations:
- Both models start from the same SFT checkpoint
- The reference model remains frozen throughout training
- CPU offloading of the reference model is recommended for memory efficiency
- LoRA can be applied to the policy model for parameter-efficient training
Step 3: Prepare preference dataset
Load the preference dataset containing chosen and rejected response pairs. Tokenize both responses using the model tokenizer with appropriate chat templates. The dataset is created with the DPO flag to handle paired input formatting.
Key considerations:
- The dataset must contain matched pairs of chosen and rejected responses per prompt
- Chat templates must match the model family
- Maximum sequence length should accommodate the longest responses
Step 4: Setup optimizer and scheduler
Configure the optimizer with a learning rate tuned for DPO (typically lower than SFT, e.g., 5e-7). Set up cosine learning rate scheduling with warmup.
Key considerations:
- DPO typically uses lower learning rates than SFT
- The beta parameter controls the strength of the KL constraint (default 0.1)
- Label smoothing can improve robustness to noisy preference labels
Step 5: Train with DPO objective
Execute the DPO training loop. For each batch, compute log-probabilities of chosen and rejected responses under both the policy and reference models. Calculate the DPO loss that pushes the policy to prefer chosen over rejected responses relative to the reference baseline. Optionally add an NLL auxiliary loss on chosen responses to prevent degeneration.
Key considerations:
- The DPO loss implicitly defines a reward through the log-probability ratio
- IPO variant uses a squared hinge loss instead of sigmoid for more conservative updates
- The NLL loss coefficient prevents the model from degrading on general generation quality
- Monitor both loss convergence and chosen/rejected accuracy
Step 6: Save aligned model
Save the trained policy model weights and tokenizer. For LoRA training, save only the adapter weights.
Key considerations:
- Only the policy model is saved (reference model is not modified)
- The aligned model can be used directly for inference or as the starting point for iterative DPO