Principle:NVIDIA NeMo Aligner DPO Reference Policy Management
| Principle: DPO Reference Policy Management | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | NLP, Alignment |
| Related | Implementation:NVIDIA_NeMo_Aligner_Retrieve_Model_State_Dict |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for maintaining a frozen copy of the initial model weights as the reference policy for KL-constrained preference optimization.
Description
DPO and related algorithms require a reference policy pi_ref to compute the KL divergence term that prevents the trained policy from deviating too far from the pretrained model.
In NeMo Aligner, the reference policy is implemented through two strategies depending on the training mode:
Full-parameter training:
- The model's full state dict is copied to CPU memory before training begins.
- During each forward pass, the reference weights are temporarily swapped into the model to compute reference log probabilities.
- After reference computation, the training weights are restored.
- This approach avoids maintaining two GPU copies of the model at the cost of CPU memory and swap overhead.
PEFT/LoRA training:
- The reference policy is implicitly the base model (adapter weights set to zero).
- Reference log probabilities are computed by disabling the adapter weights during the forward pass.
- This eliminates the need for a CPU copy entirely, as the frozen base model weights are the reference.
Usage
Use in DPO/IPO/RPO training.
- For full-parameter training, the CPU state dict copy is created automatically at initialization.
- For PEFT training, reference log probs are computed with adapter weights disabled -- no extra memory required.
- Memory requirement: full-parameter DPO effectively doubles the model memory footprint due to the CPU reference copy.
- The weight swap operation adds latency to each training step proportional to the model size.
Theoretical Basis
The reference policy ensures optimization stays close to the pretrained model. The DPO loss implicitly defines a reward function:
r(x, y) = beta * log( pi_theta(y|x) / pi_ref(y|x) ) + beta * log Z(x)
Without the reference policy, the model could collapse to degenerate solutions -- for example, assigning all probability mass to a single response regardless of the prompt, or exploiting spurious patterns in the preference data.
The KL constraint ensures:
KL(pi_theta || pi_ref) remains bounded
Equivalent to constraining the implicit reward:
|r(x, y)| = |beta * log( pi_theta(y|x) / pi_ref(y|x) )| stays moderate
For PEFT training, the reference is exact (base model weights are literally unchanged), while for full-parameter training the reference is a frozen snapshot from initialization.
Pseudo-code
FUNCTION initialize_reference_policy(model, is_peft):
IF is_peft:
# No copy needed; base model IS the reference
RETURN None
ELSE:
# Copy full state dict to CPU
ref_state_dict = {}
FOR each (name, param) in model.state_dict():
ref_state_dict[name] = param.clone().to("cpu")
RETURN ref_state_dict
FUNCTION compute_reference_log_probs(model, ref_state_dict, is_peft, inputs):
IF is_peft:
# Disable adapters to get base model behavior
disable_adapter_weights(model)
ref_logprobs = model.forward(inputs)
enable_adapter_weights(model)
ELSE:
# Swap in reference weights
current_state = save_current_weights(model)
load_state_dict(model, ref_state_dict)
ref_logprobs = model.forward(inputs)
load_state_dict(model, current_state)
RETURN ref_logprobs