Principle:Allenai Open instruct DPO Training
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Distributed Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
DPO training is the end-to-end workflow for optimizing a language model's policy to align with human preferences using Direct Preference Optimization, without requiring a separately trained reward model.
Description
The DPO training loop implements a preference-based fine-tuning pipeline that takes a pre-trained language model and a dataset of preference pairs (chosen vs. rejected responses), and produces an aligned model. Unlike RLHF, which trains a reward model and then uses reinforcement learning (e.g., PPO) to optimize the policy, DPO directly optimizes the policy using a closed-form loss derived from the preference data.
The training workflow consists of the following stages:
1. Initialization:
- Set up the distributed training environment using HuggingFace Accelerate with optional DeepSpeed ZeRO optimization (stages 0-3).
- Load the pre-trained model and tokenizer.
- Configure LoRA adapters if parameter-efficient fine-tuning is desired.
- Prepare the preference dataset via tokenization and filtering.
2. Reference Logprob Caching (for DPO, DPO-Norm, WPO):
- Before training begins, compute the reference model's log-probabilities for all training examples.
- Cache these values to disk and GPU memory for reuse throughout training.
- This step is skipped for SimPO, which does not require a reference model.
3. Training Loop:
- For each batch, run the policy model forward on both chosen and rejected responses (either concatenated for efficiency or separately for memory savings).
- Compute the DPO loss variant specified by the configuration (standard DPO, DPO-Norm, SimPO, or WPO).
- Backpropagate gradients, clip gradient norms (if configured), and update the optimizer.
- Track metrics including loss, implicit rewards, reward accuracy, and reward margin.
4. Checkpointing and Logging:
- Save model checkpoints at configured intervals (step-based or epoch-based).
- Log training metrics to Weights & Biases for experiment tracking.
- Support resumption from checkpoints for fault tolerance.
5. Post-Training:
- Save the final model and tokenizer.
- Optionally push to HuggingFace Hub and launch evaluation jobs.
Usage
Use the DPO training loop when:
- You want to align a language model using preference data.
- You prefer a simpler training pipeline compared to RLHF (no separate reward model training).
- You need distributed training across multiple GPUs or nodes.
- You want to experiment with different DPO loss variants (standard, normalized, SimPO, WPO).
Theoretical Basis
The DPO training procedure optimizes:
where is the chosen loss variant. The training loop implements stochastic gradient descent on this objective:
# Pre-training: cache reference logprobs (if needed)
if loss_type needs reference model:
reference_cache = build_reference_logprobs_cache(model, dataloader)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
# Forward pass: compute policy logprobs
chosen_logps, rejected_logps = forward(model, batch)
# Compute loss (dispatches to appropriate variant)
loss = compute_loss(args, batch, chosen_logps, rejected_logps, reference_cache)
# Backward pass and optimization
loss.backward()
if max_grad_norm > 0:
clip_grad_norm(model.parameters(), max_grad_norm)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
# Log metrics
log(loss, chosen_rewards, rejected_rewards)
# Checkpoint
save_checkpoint(model, optimizer, lr_scheduler)
The forward pass can be done in two modes:
- Concatenated forward: Chosen and rejected inputs are concatenated into a single batch and processed in one model forward pass. This is more efficient for FSDP-based distributed training.
- Separate forward: Chosen and rejected inputs are processed in two separate forward passes. This uses less peak memory but is slower.