Workflow:Allenai Open instruct DPO Preference Tuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Preference_Optimization, Post_Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
End-to-end process for aligning language models with human preferences using Direct Preference Optimization (DPO) on chosen/rejected response pairs.
Description
This workflow trains a language model to prefer human-chosen responses over rejected alternatives without requiring a separate reward model. It uses the DPO algorithm (specifically the length-normalized dpo_norm variant based on SimPO) to directly optimize the model policy from preference data. The implementation caches reference model logprobs and then removes the reference model from memory to reduce GPU usage during training. Training uses Accelerate with DeepSpeed ZeRO Stage 3 for multi-node distribution.
The primary entry point is dpo_tune_cache.py for the Accelerate/DeepSpeed backend. An alternative dpo.py implementation uses the OLMo-core backend for supported models.
Usage
Execute this workflow when you have an SFT-trained model and a preference dataset containing chosen/rejected response pairs. This is typically the second stage of the Tulu post-training pipeline, taking an SFT checkpoint as input and producing a DPO-aligned model that feeds into the RLVR stage.
Execution Steps
Step 1: Environment_Setup
Prepare the training environment, identical to the SFT workflow. Ensure Accelerate, DeepSpeed, and all dependencies are available. For Beaker-based runs, build and register a Docker image from the current commit.
Key considerations:
- Same infrastructure requirements as SFT training
- DPO typically uses fewer nodes than SFT (e.g., 4 nodes vs 8 for 8B models)
Step 2: Preference_Data_Loading
Load and prepare the preference dataset containing chosen and rejected response pairs. The dataset mixer supports combining multiple preference sources with specified proportions. Data is tokenized and filtered by maximum sequence length.
Key considerations:
- Preference data must contain paired chosen/rejected responses
- The mixer_list argument specifies datasets and proportions
- Sequence length is typically shorter than SFT (2048 vs 4096 tokens)
- Both the Accelerate backend (dpo_tune_cache.py) and OLMo-core backend (dpo.py) accept the same data format
Step 3: Reference_Logprob_Caching
Compute and cache the log probabilities from the reference model (the initial SFT checkpoint) on the entire preference dataset. After caching, the reference model is removed from GPU memory to free up resources for training. This is a key memory optimization unique to this DPO implementation.
Key considerations:
- Initial training output will be delayed while logprobs are being computed
- The reference model is the same as the model being trained (the SFT checkpoint)
- Caching avoids keeping two full model copies in memory simultaneously
Step 4: DPO_Training
Train the model using the DPO loss function, which maximizes the margin between chosen and rejected response probabilities relative to the cached reference logprobs. The default loss type is dpo_norm (length-normalized DPO, based on SimPO). Training uses gradient checkpointing and Accelerate with DeepSpeed for memory efficiency.
Key considerations:
- The dpo_beta parameter controls deviation from the reference policy (default: 5)
- Loss types include standard dpo, length-normalized dpo_norm, and SimPO variants
- Training metrics include chosen/rejected rewards, accuracy, and reward margin
- Gradient checkpointing is recommended for memory efficiency
Step 5: Checkpoint_Saving
Save the DPO-aligned model checkpoint at specified intervals and at training completion. The checkpoint can be uploaded to HuggingFace Hub and is directly usable as the starting point for the RLVR stage.
Key considerations:
- The saved model is compatible with downstream GRPO/RLVR training
- Checkpoint saving interval can be set per epoch or per fixed steps
- The DPO model is also usable as a standalone instruction-following model