Workflow:OpenRLHF OpenRLHF DPO Training

Knowledge Sources	OpenRLHF Hugging Face Transformers DeepSpeed
Domains	LLMs, RLHF, DPO, Alignment
Last Updated	2026-02-07 10:00 GMT

Overview

End-to-end process for aligning a language model using Direct Preference Optimization (DPO) on human preference pairs without training a separate reward model.

Description

This workflow implements offline preference alignment using DPO, which directly optimizes the policy model on preference pairs without requiring a separate reward model or RL training loop. It loads both a trainable policy model and a frozen reference model, then trains the policy to increase the likelihood gap between chosen and rejected responses relative to the reference model. Variants include IPO (Identity Preference Optimization) and cDPO (with label smoothing). The approach is simpler and more stable than PPO but operates on a fixed offline dataset.

Usage

Execute this workflow when you have a preference dataset (chosen/rejected pairs) and want to align your model without the complexity of training a reward model and running PPO. DPO is suitable when you have high-quality static preference data and do not need online data generation. It is simpler to implement and tune than PPO, though it may be less effective when iterating on data.

Execution Steps

Step 1: Configure distributed strategy

Initialize the DeepSpeed training strategy with ZeRO-3 parallelism. Configure precision settings and gradient accumulation for handling the memory requirements of loading two models simultaneously (policy and reference).

Key considerations:

Two full models must fit in memory (policy + reference)
Reference model can be offloaded to CPU to reduce GPU memory pressure
ZeRO-3 shards both models across GPUs

Step 2: Load policy and reference models

Load the policy model (the model being trained) and the reference model (a frozen copy for computing the implicit reward). Both are typically initialized from the same SFT checkpoint. The reference model is set to evaluation mode and its gradients are disabled.

Key considerations:

Both models start from the same SFT checkpoint
The reference model remains frozen throughout training
CPU offloading of the reference model is recommended for memory efficiency
LoRA can be applied to the policy model for parameter-efficient training

Step 3: Prepare preference dataset

Load the preference dataset containing chosen and rejected response pairs. Tokenize both responses using the model tokenizer with appropriate chat templates. The dataset is created with the DPO flag to handle paired input formatting.

Key considerations:

The dataset must contain matched pairs of chosen and rejected responses per prompt
Chat templates must match the model family
Maximum sequence length should accommodate the longest responses

Step 4: Setup optimizer and scheduler

Configure the optimizer with a learning rate tuned for DPO (typically lower than SFT, e.g., 5e-7). Set up cosine learning rate scheduling with warmup.

Key considerations:

DPO typically uses lower learning rates than SFT
The beta parameter controls the strength of the KL constraint (default 0.1)
Label smoothing can improve robustness to noisy preference labels

Step 5: Train with DPO objective

Execute the DPO training loop. For each batch, compute log-probabilities of chosen and rejected responses under both the policy and reference models. Calculate the DPO loss that pushes the policy to prefer chosen over rejected responses relative to the reference baseline. Optionally add an NLL auxiliary loss on chosen responses to prevent degeneration.

Key considerations:

The DPO loss implicitly defines a reward through the log-probability ratio
IPO variant uses a squared hinge loss instead of sigmoid for more conservative updates
The NLL loss coefficient prevents the model from degrading on general generation quality
Monitor both loss convergence and chosen/rejected accuracy

Step 6: Save aligned model

Save the trained policy model weights and tokenizer. For LoRA training, save only the adapter weights.

Key considerations:

Only the policy model is saved (reference model is not modified)
The aligned model can be used directly for inference or as the starting point for iterative DPO

Execution Diagram

GitHub URL

Workflow Repository