Principle:NVIDIA NeMo Aligner DPO Training

Principle: DPO Training
Type	Principle
Project	NVIDIA NeMo Aligner
Domains	NLP, Alignment
Related	Implementation:NVIDIA_NeMo_Aligner_DPOTrainer_Fit
Last Updated	2026-02-07 00:00 GMT

Overview

Direct Preference Optimization training loop that aligns language models using preference pairs without a separate reward model.

Description

DPO reformulates the RLHF objective as a supervised classification problem on preference pairs. Instead of training a separate reward model and running online RL, DPO directly optimizes the policy using the implicit reward defined by the log-probability ratio between the current policy and a reference policy.

The training loop:

Maintains a frozen reference policy (either as CPU-stored weights or the initial PEFT adapter state).
Computes log probabilities for both chosen and rejected responses under both the current and reference policies.
Optimizes the DPO loss (or one of its variants) to increase the relative probability of chosen over rejected responses.

NeMo Aligner supports multiple DPO loss variants:

Standard DPO -- binary cross-entropy on the reward margin
IPO (Identity Preference Optimization) -- squared loss variant
RPO (Relative Preference Optimization) -- forward KL, backward KL, and squared variants
Auxiliary SFT loss -- optional supervised fine-tuning loss on chosen responses, weighted by a configurable coefficient

Usage

Use for preference-based alignment when you want a simpler alternative to PPO that does not require online generation or a reward model server.

DPO training is more stable and computationally efficient than PPO.
Requires pre-collected preference data (no online generation).
Supports LoRA/PEFT for parameter-efficient training.
No need for critic server, reward model server, or online rollout infrastructure.
Trade-off: cannot adapt to distribution shift during training (offline method).

Theoretical Basis

DPO loss (standard):

L_DPO = -log sigma( beta * ( log pi_theta(y_w|x) / pi_ref(y_w|x)
                            - log pi_theta(y_l|x) / pi_ref(y_l|x) ) )

where:
  y_w = chosen (winning) response
  y_l = rejected (losing) response
  beta = temperature parameter controlling deviation from reference
  sigma = sigmoid function

IPO loss (Identity Preference Optimization):

L_IPO = ( log pi_theta(y_w|x) / pi_ref(y_w|x)
        - log pi_theta(y_l|x) / pi_ref(y_l|x)
        - 1 / (2 * beta) )^2

The reference policy pi_ref is the initial model before training. The implicit reward is:

r(x, y) = beta * log( pi_theta(y|x) / pi_ref(y|x) ) + beta * log Z(x)

Pseudo-code

FUNCTION dpo_training_loop(model, ref_policy, dataloader, config):
    FOR each batch in dataloader:
        chosen_input, rejected_input = batch

        # Compute log probs under current policy
        chosen_logprobs = model.forward(chosen_input)
        rejected_logprobs = model.forward(rejected_input)

        # Compute log probs under reference policy
        ref_chosen_logprobs = ref_policy.forward(chosen_input)
        ref_rejected_logprobs = ref_policy.forward(rejected_input)

        # Compute reward margins
        chosen_reward = chosen_logprobs - ref_chosen_logprobs
        rejected_reward = rejected_logprobs - ref_rejected_logprobs

        # Compute DPO loss
        IF config.loss_type == "dpo":
            loss = -log_sigmoid(beta * (chosen_reward - rejected_reward))
        ELSE IF config.loss_type == "ipo":
            loss = (chosen_reward - rejected_reward - 1/(2*beta))^2

        # Optional auxiliary SFT loss
        IF config.sft_loss_weight > 0:
            loss = loss + config.sft_loss_weight * sft_loss(chosen_input)

        update_model(model, loss)

    RETURN model

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment