Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner DPO Reference Policy Management

From Leeroopedia


Principle: DPO Reference Policy Management
Type Principle
Project NVIDIA NeMo Aligner
Domains NLP, Alignment
Related Implementation:NVIDIA_NeMo_Aligner_Retrieve_Model_State_Dict
Last Updated 2026-02-07 00:00 GMT

Overview

Mechanism for maintaining a frozen copy of the initial model weights as the reference policy for KL-constrained preference optimization.

Description

DPO and related algorithms require a reference policy pi_ref to compute the KL divergence term that prevents the trained policy from deviating too far from the pretrained model.

In NeMo Aligner, the reference policy is implemented through two strategies depending on the training mode:

Full-parameter training:

  • The model's full state dict is copied to CPU memory before training begins.
  • During each forward pass, the reference weights are temporarily swapped into the model to compute reference log probabilities.
  • After reference computation, the training weights are restored.
  • This approach avoids maintaining two GPU copies of the model at the cost of CPU memory and swap overhead.

PEFT/LoRA training:

  • The reference policy is implicitly the base model (adapter weights set to zero).
  • Reference log probabilities are computed by disabling the adapter weights during the forward pass.
  • This eliminates the need for a CPU copy entirely, as the frozen base model weights are the reference.

Usage

Use in DPO/IPO/RPO training.

  • For full-parameter training, the CPU state dict copy is created automatically at initialization.
  • For PEFT training, reference log probs are computed with adapter weights disabled -- no extra memory required.
  • Memory requirement: full-parameter DPO effectively doubles the model memory footprint due to the CPU reference copy.
  • The weight swap operation adds latency to each training step proportional to the model size.

Theoretical Basis

The reference policy ensures optimization stays close to the pretrained model. The DPO loss implicitly defines a reward function:

r(x, y) = beta * log( pi_theta(y|x) / pi_ref(y|x) ) + beta * log Z(x)

Without the reference policy, the model could collapse to degenerate solutions -- for example, assigning all probability mass to a single response regardless of the prompt, or exploiting spurious patterns in the preference data.

The KL constraint ensures:

KL(pi_theta || pi_ref) remains bounded

Equivalent to constraining the implicit reward:
  |r(x, y)| = |beta * log( pi_theta(y|x) / pi_ref(y|x) )| stays moderate

For PEFT training, the reference is exact (base model weights are literally unchanged), while for full-parameter training the reference is a frozen snapshot from initialization.

Pseudo-code

FUNCTION initialize_reference_policy(model, is_peft):
    IF is_peft:
        # No copy needed; base model IS the reference
        RETURN None
    ELSE:
        # Copy full state dict to CPU
        ref_state_dict = {}
        FOR each (name, param) in model.state_dict():
            ref_state_dict[name] = param.clone().to("cpu")
        RETURN ref_state_dict

FUNCTION compute_reference_log_probs(model, ref_state_dict, is_peft, inputs):
    IF is_peft:
        # Disable adapters to get base model behavior
        disable_adapter_weights(model)
        ref_logprobs = model.forward(inputs)
        enable_adapter_weights(model)
    ELSE:
        # Swap in reference weights
        current_state = save_current_weights(model)
        load_state_dict(model, ref_state_dict)
        ref_logprobs = model.forward(inputs)
        load_state_dict(model, current_state)

    RETURN ref_logprobs

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment