Workflow:NVIDIA NeMo Aligner DPO Training

Knowledge Sources	NeMo-Aligner NeMo Aligner DPO Guide NeMo-Aligner Paper DPO Paper
Domains	LLMs, Preference_Optimization, Model_Alignment
Last Updated	2026-02-07 22:00 GMT

Overview

End-to-end Direct Preference Optimization (DPO) training process that aligns language models using preference pairs without requiring a separate reward model or reinforcement learning.

Description

This workflow implements DPO alignment, a simpler alternative to RLHF that directly optimizes the policy model using preference data. DPO reparameterizes the reward model implicit in the RLHF objective, allowing direct optimization from preference pairs (chosen vs rejected responses). The training maintains a reference policy (either as a frozen copy of the full model or implicitly through LoRA) and optimizes the actor to increase the likelihood of chosen responses relative to rejected ones, with a KL penalty controlled by the beta parameter. The workflow also supports variants including IPO (Identity Preference Optimization) and RPO (Reward-aware Preference Optimization). Sequence packing is supported for improved GPU utilization.

Key outputs:

A DPO-aligned model checkpoint
Training metrics including accuracy, chosen/rejected rewards, and loss

Scope:

From a pretrained/SFT .nemo checkpoint and preference pair data to an aligned model

Usage

Execute this workflow when you have a preference dataset with chosen and rejected response pairs and want to align a model without the complexity of training a separate reward model or running multi-process RLHF. DPO is recommended when you have high-quality preference data and want a simpler training pipeline than PPO. It is commonly applied after SFT training.

Execution Steps

Step 1: Prepare preference dataset

Format the preference data into JSONL files where each line contains a prompt, a chosen (preferred) response, and a rejected (dispreferred) response. The prompt and responses must follow the exact template format used during SFT training (e.g., extra_id template for NeMo models). For RPO variants, include chosen_reward and rejected_reward fields.

Key considerations:

Each line must have prompt, chosen_response, and rejected_response fields
The prompt template must exactly match the format used during SFT
The dataset must contain at least as many samples as the global batch size
Create separate train and validation JSONL files

Step 2: Optionally pack sequences

For improved GPU utilization, optionally pack multiple preference pairs into longer sequences using the provided packing script. The script tokenizes sequences, concatenates chosen and rejected responses, and applies a bin-packing algorithm to minimize padding waste. Two packing strategies are available: first_fit_decreasing for optimal packing and first_fit_shuffle for randomized order.

Key considerations:

Pack sizes should be at least double model.encoder_seq_length
Packing requires micro_batch_size=1 and TransformerEngine enabled
Adjust global batch size to account for increased examples per pack
Both train and validation datasets should be packed

Step 3: Configure DPO training

Set up the Hydra configuration specifying the pretrained checkpoint, data paths, batch sizes, learning rate, and DPO-specific hyperparameters. The key DPO parameter is ref_policy_kl_penalty (beta), which controls the strength of the KL divergence constraint against the reference policy. Choose between full-parameter training (requires storing a reference model copy) or LoRA-based training (reference model is implicit).

Key considerations:

Beta values of 0.1 to 1.0 generally work well
Lower global batch sizes tend to underperform; 256 or 512 is recommended
Start with 1 epoch and increase if needed (rarely beyond 3)
For LoRA DPO, no separate reference model storage is needed
Set dpo.preference_loss to select DPO, IPO, or RPO variants

Step 4: Launch DPO training

Execute the DPO training script which loads the SFT model as a MegatronGPTDPOModel, saves reference policy weights (for full-parameter training), builds preference dataloaders, and runs the DPOTrainer fit loop. The trainer computes log probabilities for both chosen and rejected responses under both the actor and reference policies, then optimizes the DPO loss.

What happens:

The SFT model is loaded and the reference policy state dict is saved
For each batch, forward passes compute log probs for chosen and rejected responses
The implicit reward is calculated as the log probability ratio scaled by beta
The DPO loss encourages the model to prefer chosen over rejected responses
Accuracy metric tracks how often chosen rewards exceed rejected rewards

Step 5: Evaluate and export

Monitor the training accuracy (percentage of correctly ranked pairs) and the gap between chosen and rejected reward means. After training, the aligned model checkpoint is saved and can be used for inference or further fine-tuning. Evaluation can be performed using NeMo's inference scripts with the model's prompt template.

Key considerations:

Accuracy should generally increase during training
rewards_chosen_mean should consistently exceed rewards_rejected_mean
Absolute accuracy values may be low but the trend should be upward
The final model inherits the prompt template from the SFT model

Execution Diagram

GitHub URL

Workflow Repository