Principle:NVIDIA NeMo Aligner DPO Training
| Principle: DPO Training | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | NLP, Alignment |
| Related | Implementation:NVIDIA_NeMo_Aligner_DPOTrainer_Fit |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Direct Preference Optimization training loop that aligns language models using preference pairs without a separate reward model.
Description
DPO reformulates the RLHF objective as a supervised classification problem on preference pairs. Instead of training a separate reward model and running online RL, DPO directly optimizes the policy using the implicit reward defined by the log-probability ratio between the current policy and a reference policy.
The training loop:
- Maintains a frozen reference policy (either as CPU-stored weights or the initial PEFT adapter state).
- Computes log probabilities for both chosen and rejected responses under both the current and reference policies.
- Optimizes the DPO loss (or one of its variants) to increase the relative probability of chosen over rejected responses.
NeMo Aligner supports multiple DPO loss variants:
- Standard DPO -- binary cross-entropy on the reward margin
- IPO (Identity Preference Optimization) -- squared loss variant
- RPO (Relative Preference Optimization) -- forward KL, backward KL, and squared variants
- Auxiliary SFT loss -- optional supervised fine-tuning loss on chosen responses, weighted by a configurable coefficient
Usage
Use for preference-based alignment when you want a simpler alternative to PPO that does not require online generation or a reward model server.
- DPO training is more stable and computationally efficient than PPO.
- Requires pre-collected preference data (no online generation).
- Supports LoRA/PEFT for parameter-efficient training.
- No need for critic server, reward model server, or online rollout infrastructure.
- Trade-off: cannot adapt to distribution shift during training (offline method).
Theoretical Basis
DPO loss (standard):
L_DPO = -log sigma( beta * ( log pi_theta(y_w|x) / pi_ref(y_w|x)
- log pi_theta(y_l|x) / pi_ref(y_l|x) ) )
where:
y_w = chosen (winning) response
y_l = rejected (losing) response
beta = temperature parameter controlling deviation from reference
sigma = sigmoid function
IPO loss (Identity Preference Optimization):
L_IPO = ( log pi_theta(y_w|x) / pi_ref(y_w|x)
- log pi_theta(y_l|x) / pi_ref(y_l|x)
- 1 / (2 * beta) )^2
The reference policy pi_ref is the initial model before training. The implicit reward is:
r(x, y) = beta * log( pi_theta(y|x) / pi_ref(y|x) ) + beta * log Z(x)
Pseudo-code
FUNCTION dpo_training_loop(model, ref_policy, dataloader, config):
FOR each batch in dataloader:
chosen_input, rejected_input = batch
# Compute log probs under current policy
chosen_logprobs = model.forward(chosen_input)
rejected_logprobs = model.forward(rejected_input)
# Compute log probs under reference policy
ref_chosen_logprobs = ref_policy.forward(chosen_input)
ref_rejected_logprobs = ref_policy.forward(rejected_input)
# Compute reward margins
chosen_reward = chosen_logprobs - ref_chosen_logprobs
rejected_reward = rejected_logprobs - ref_rejected_logprobs
# Compute DPO loss
IF config.loss_type == "dpo":
loss = -log_sigmoid(beta * (chosen_reward - rejected_reward))
ELSE IF config.loss_type == "ipo":
loss = (chosen_reward - rejected_reward - 1/(2*beta))^2
# Optional auxiliary SFT loss
IF config.sft_loss_weight > 0:
loss = loss + config.sft_loss_weight * sft_loss(chosen_input)
update_model(model, loss)
RETURN model
Related Pages
- Implementation:NVIDIA_NeMo_Aligner_DPOTrainer_Fit
- Heuristic:NVIDIA_NeMo_Aligner_Higher_Stability_Log_Probs
- Heuristic:NVIDIA_NeMo_Aligner_DPO_Sequence_Packing_Tips