Workflow:Eric mitchell Direct preference optimization DPO Preference Training

Knowledge Sources	Direct Preference Optimization DPO Paper Conservative DPO IPO
Domains	LLMs, Preference_Learning, RLHF
Last Updated	2026-02-08 01:00 GMT

Overview

End-to-end process for training a language model from human preference data using Direct Preference Optimization (DPO), starting from a supervised fine-tuned checkpoint.

Description

This workflow covers the second and core stage of the DPO pipeline: preference-based training. Given an SFT checkpoint, it loads both a trainable policy model and a frozen reference model from the same weights, then trains the policy using the DPO loss function (Equation 7 of the DPO paper). The DPO loss directly optimizes the policy to prefer chosen responses over rejected ones, bypassing the need for an explicit reward model. The workflow supports three loss variants: standard DPO, conservative DPO (with label smoothing for noisy preferences), and IPO (Identity Preference Optimization). It handles paired preference data where chosen and rejected responses are concatenated for efficient single-pass forward computation.

Usage

Execute this workflow after completing SFT training and obtaining a checkpoint (policy.pt). You need the same preference dataset used for SFT and a beta parameter controlling the strength of the KL divergence constraint from the reference model (typically 0.1 to 0.5). Use this when you want to align a language model with human preferences without training a separate reward model.

Execution Steps

Step 1: SFT_Checkpoint_Preparation

Locate the SFT checkpoint from the previous training stage. The checkpoint path follows the pattern {run_dir}/LATEST/policy.pt from a completed SFT run. This checkpoint contains the policy state dict that will initialize both the trainable policy and the frozen reference model.

Key considerations:

The checkpoint must contain a valid state dict under the 'state' key
Both policy and reference model will be initialized from the same weights
The checkpoint step and metrics are logged for traceability

Step 2: Configuration_For_DPO

Configure the training run with DPO-specific settings via Hydra. Set loss=dpo (or loss=ipo for IPO variant), specify loss.beta for the KL divergence temperature, and provide the SFT checkpoint path via model.archive. Optionally configure conservative DPO via loss.label_smoothing or reference-free mode via loss.reference_free.

Key considerations:

loss.beta controls the deviation allowed from the reference model (0.1-0.5 typical range)
loss.label_smoothing enables conservative DPO for noisy preference data (0 gives original DPO)
loss.reference_free=true uses a uniform reference distribution instead of the loaded reference model
The same model config and trainer class from SFT should generally be reused

Step 3: Dual_Model_Loading

Load two instances of the causal language model: a trainable policy model and a frozen reference model. Both are initialized from the base HuggingFace model weights, then the SFT checkpoint state dict is loaded into both. Dropout is disabled on both models. The reference model remains frozen throughout training to provide the baseline log probabilities for the DPO loss.

Key considerations:

The reference model is never updated during training (used with torch.no_grad)
Both models must fit in GPU memory simultaneously, doubling the memory requirement compared to SFT
For FSDP, both models are sharded independently across all GPUs
The reference model dtype can differ from the policy dtype to save memory

Step 4: Preference_Data_Loading

Load the preference dataset(s) in DPO mode, where each example contains a prompt with both a chosen and a rejected response. The tokenizer processes prompt-chosen and prompt-rejected pairs, creating labels that mask prompt tokens. The batch iterator yields paired examples with all necessary input IDs, attention masks, and labels for both chosen and rejected sequences.

Key considerations:

Unlike SFT mode, both chosen and rejected responses are tokenized per example
Preference pairs specify which response is preferred via index tuples
Multiple datasets can be combined with dataset-specific truncation modes
The data format must include responses, pairs, and sft_target keys per prompt

Step 5: DPO_Training_Loop

Execute the DPO training loop. For each batch, run a concatenated forward pass through both the policy and reference models to compute log probabilities for chosen and rejected responses. Compute the DPO loss as the negative log-sigmoid of the scaled difference between policy and reference log-ratios. Track reward accuracy (fraction where chosen reward exceeds rejected reward) as the primary evaluation metric. Apply gradient accumulation, clipping, and optimizer updates as in SFT.

What happens:

Chosen and rejected inputs are concatenated into a single batch for a single forward pass (efficiency optimization for FSDP)
The policy model produces log probabilities that are compared against the frozen reference model
The DPO loss encourages the policy to increase the probability gap between chosen and rejected responses
Reward accuracy measures alignment quality and should increase during training

Step 6: Evaluation_And_Checkpointing

Periodically evaluate the model on a held-out set, computing DPO loss, reward accuracy, and reward margins. Optionally generate text samples from both the policy and reference models for qualitative comparison. Save checkpoints containing the policy state dict, optimizer state, and scheduler state at each evaluation point and at the end of training.

Key considerations:

Reward accuracy above 50% indicates the model prefers chosen over rejected responses
Reward margins (chosen_reward - rejected_reward) should increase during training
Sample generation is slow with FSDP and TensorParallel trainers; disable with sample_during_eval=false
The final checkpoint at LATEST/policy.pt contains the fully trained DPO model

Execution Diagram

GitHub URL

Workflow Repository