Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Allenai Open instruct DPO Preference Tuning

From Leeroopedia


Knowledge Sources
Domains LLMs, Preference_Optimization, Post_Training
Last Updated 2026-02-07 00:00 GMT

Overview

End-to-end process for aligning language models with human preferences using Direct Preference Optimization (DPO) on chosen/rejected response pairs.

Description

This workflow trains a language model to prefer human-chosen responses over rejected alternatives without requiring a separate reward model. It uses the DPO algorithm (specifically the length-normalized dpo_norm variant based on SimPO) to directly optimize the model policy from preference data. The implementation caches reference model logprobs and then removes the reference model from memory to reduce GPU usage during training. Training uses Accelerate with DeepSpeed ZeRO Stage 3 for multi-node distribution.

The primary entry point is dpo_tune_cache.py for the Accelerate/DeepSpeed backend. An alternative dpo.py implementation uses the OLMo-core backend for supported models.

Usage

Execute this workflow when you have an SFT-trained model and a preference dataset containing chosen/rejected response pairs. This is typically the second stage of the Tulu post-training pipeline, taking an SFT checkpoint as input and producing a DPO-aligned model that feeds into the RLVR stage.

Execution Steps

Step 1: Environment_Setup

Prepare the training environment, identical to the SFT workflow. Ensure Accelerate, DeepSpeed, and all dependencies are available. For Beaker-based runs, build and register a Docker image from the current commit.

Key considerations:

  • Same infrastructure requirements as SFT training
  • DPO typically uses fewer nodes than SFT (e.g., 4 nodes vs 8 for 8B models)

Step 2: Preference_Data_Loading

Load and prepare the preference dataset containing chosen and rejected response pairs. The dataset mixer supports combining multiple preference sources with specified proportions. Data is tokenized and filtered by maximum sequence length.

Key considerations:

  • Preference data must contain paired chosen/rejected responses
  • The mixer_list argument specifies datasets and proportions
  • Sequence length is typically shorter than SFT (2048 vs 4096 tokens)
  • Both the Accelerate backend (dpo_tune_cache.py) and OLMo-core backend (dpo.py) accept the same data format

Step 3: Reference_Logprob_Caching

Compute and cache the log probabilities from the reference model (the initial SFT checkpoint) on the entire preference dataset. After caching, the reference model is removed from GPU memory to free up resources for training. This is a key memory optimization unique to this DPO implementation.

Key considerations:

  • Initial training output will be delayed while logprobs are being computed
  • The reference model is the same as the model being trained (the SFT checkpoint)
  • Caching avoids keeping two full model copies in memory simultaneously

Step 4: DPO_Training

Train the model using the DPO loss function, which maximizes the margin between chosen and rejected response probabilities relative to the cached reference logprobs. The default loss type is dpo_norm (length-normalized DPO, based on SimPO). Training uses gradient checkpointing and Accelerate with DeepSpeed for memory efficiency.

Key considerations:

  • The dpo_beta parameter controls deviation from the reference policy (default: 5)
  • Loss types include standard dpo, length-normalized dpo_norm, and SimPO variants
  • Training metrics include chosen/rejected rewards, accuracy, and reward margin
  • Gradient checkpointing is recommended for memory efficiency

Step 5: Checkpoint_Saving

Save the DPO-aligned model checkpoint at specified intervals and at training completion. The checkpoint can be uploaded to HuggingFace Hub and is directly usable as the starting point for the RLVR stage.

Key considerations:

  • The saved model is compatible with downstream GRPO/RLVR training
  • Checkpoint saving interval can be set per epoch or per fixed steps
  • The DPO model is also usable as a standalone instruction-following model

Execution Diagram

GitHub URL

Workflow Repository