Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner DPO Training

From Leeroopedia


Principle: DPO Training
Type Principle
Project NVIDIA NeMo Aligner
Domains NLP, Alignment
Related Implementation:NVIDIA_NeMo_Aligner_DPOTrainer_Fit
Last Updated 2026-02-07 00:00 GMT

Overview

Direct Preference Optimization training loop that aligns language models using preference pairs without a separate reward model.

Description

DPO reformulates the RLHF objective as a supervised classification problem on preference pairs. Instead of training a separate reward model and running online RL, DPO directly optimizes the policy using the implicit reward defined by the log-probability ratio between the current policy and a reference policy.

The training loop:

  • Maintains a frozen reference policy (either as CPU-stored weights or the initial PEFT adapter state).
  • Computes log probabilities for both chosen and rejected responses under both the current and reference policies.
  • Optimizes the DPO loss (or one of its variants) to increase the relative probability of chosen over rejected responses.

NeMo Aligner supports multiple DPO loss variants:

  • Standard DPO -- binary cross-entropy on the reward margin
  • IPO (Identity Preference Optimization) -- squared loss variant
  • RPO (Relative Preference Optimization) -- forward KL, backward KL, and squared variants
  • Auxiliary SFT loss -- optional supervised fine-tuning loss on chosen responses, weighted by a configurable coefficient

Usage

Use for preference-based alignment when you want a simpler alternative to PPO that does not require online generation or a reward model server.

  • DPO training is more stable and computationally efficient than PPO.
  • Requires pre-collected preference data (no online generation).
  • Supports LoRA/PEFT for parameter-efficient training.
  • No need for critic server, reward model server, or online rollout infrastructure.
  • Trade-off: cannot adapt to distribution shift during training (offline method).

Theoretical Basis

DPO loss (standard):

L_DPO = -log sigma( beta * ( log pi_theta(y_w|x) / pi_ref(y_w|x)
                            - log pi_theta(y_l|x) / pi_ref(y_l|x) ) )

where:
  y_w = chosen (winning) response
  y_l = rejected (losing) response
  beta = temperature parameter controlling deviation from reference
  sigma = sigmoid function

IPO loss (Identity Preference Optimization):

L_IPO = ( log pi_theta(y_w|x) / pi_ref(y_w|x)
        - log pi_theta(y_l|x) / pi_ref(y_l|x)
        - 1 / (2 * beta) )^2

The reference policy pi_ref is the initial model before training. The implicit reward is:

r(x, y) = beta * log( pi_theta(y|x) / pi_ref(y|x) ) + beta * log Z(x)

Pseudo-code

FUNCTION dpo_training_loop(model, ref_policy, dataloader, config):
    FOR each batch in dataloader:
        chosen_input, rejected_input = batch

        # Compute log probs under current policy
        chosen_logprobs = model.forward(chosen_input)
        rejected_logprobs = model.forward(rejected_input)

        # Compute log probs under reference policy
        ref_chosen_logprobs = ref_policy.forward(chosen_input)
        ref_rejected_logprobs = ref_policy.forward(rejected_input)

        # Compute reward margins
        chosen_reward = chosen_logprobs - ref_chosen_logprobs
        rejected_reward = rejected_logprobs - ref_rejected_logprobs

        # Compute DPO loss
        IF config.loss_type == "dpo":
            loss = -log_sigmoid(beta * (chosen_reward - rejected_reward))
        ELSE IF config.loss_type == "ipo":
            loss = (chosen_reward - rejected_reward - 1/(2*beta))^2

        # Optional auxiliary SFT loss
        IF config.sft_loss_weight > 0:
            loss = loss + config.sft_loss_weight * sft_loss(chosen_input)

        update_model(model, loss)

    RETURN model

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment