Workflow:Lucidrains X transformers DPO Preference Alignment
| Knowledge Sources | |
|---|---|
| Domains | RLHF, Preference_Optimization, LLM_Alignment |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
End-to-end process for aligning a pretrained autoregressive language model with human preferences using Direct Preference Optimization (DPO) via x-transformers.
Description
This workflow covers the process of fine-tuning a pretrained x-transformers decoder model using Direct Preference Optimization (DPO), a method for aligning language models with human preferences without requiring a separate reward model. The DPO class wraps a pretrained TransformerWrapper policy model and automatically creates a frozen reference copy. Given pairs of preferred and unpreferred completions for the same prompt, DPO optimizes the policy model to increase the likelihood ratio of preferred over unpreferred responses relative to the reference model. This implements the algorithm from Rafailov et al. (2023), providing a simpler alternative to PPO-based RLHF.
Usage
Execute this workflow after you have a pretrained autoregressive language model built with x-transformers that you want to align with human preferences. You need a dataset of preference pairs: for each prompt, a preferred completion (chosen by human annotators) and an unpreferred completion. This workflow is appropriate when you want to improve the quality, safety, or helpfulness of model outputs without the complexity of training a separate reward model and running PPO.
Execution Steps
Step 1: Pretrain Base Model
Start with a pretrained TransformerWrapper decoder model. This can be trained using the Autoregressive Language Modeling workflow or loaded from a checkpoint. The model should have reasonable language modeling capabilities before preference alignment.
Key considerations:
- The base model should already generate coherent text
- The model must be a TransformerWrapper instance (required by the DPO class)
- Save a checkpoint before alignment so you can compare before/after quality
Step 2: Prepare Preference Dataset
Construct a dataset of preference pairs. Each sample consists of a prompt, a preferred completion, and an unpreferred completion. Both completions should be concatenated with the prompt and tokenized to the same sequence length, with a prompt mask indicating which tokens are part of the prompt (excluded from the DPO loss).
Key considerations:
- Preferred and unpreferred sequences must have the same shape (batch, seq_len)
- The prompt_mask is a boolean tensor where True indicates prompt tokens (these are excluded from the preference loss)
- Optional pad_id can be set to automatically derive sequence masks for variable-length sequences
- Both sequences should start with the same prompt tokens
Step 3: Initialize DPO Wrapper
Wrap the pretrained TransformerWrapper in the DPO class. This automatically creates a frozen deep copy of the model as the reference model. Configure the beta parameter that controls the strength of the KL divergence constraint against the reference model.
What happens:
- The policy model (trainable) is the original model
- A reference model (frozen, no gradients) is created as a deep copy
- The beta parameter (default 0.1) controls how much the policy can deviate from the reference
- Only the policy model's parameters are exposed for optimization
Step 4: Train with DPO
Run the DPO training loop. Each step feeds preferred sequences, unpreferred sequences, and the prompt mask into the DPO wrapper. The wrapper computes log probabilities under both the policy and reference models, then optimizes the DPO loss to increase the relative likelihood of preferred completions.
What happens:
- Reference model computes log probabilities for both preferred and unpreferred sequences (no gradient)
- Policy model computes log probabilities for both sequences (with gradient)
- The DPO loss is: -log_sigmoid(beta * (policy_log_ratio - ref_log_ratio))
- Log ratios are computed as mean log probability of preferred minus unpreferred
- Prompt tokens are masked out so only completion tokens affect the loss
Step 5: Evaluate Aligned Model
Compare the aligned policy model's generations against the original pretrained model. Use the policy model (accessible via dpo_wrapper.policy_model) for generation with the AutoregressiveWrapper or directly via the TransformerWrapper.
Key considerations:
- The aligned model should produce outputs more consistent with the preference data
- Monitor the DPO loss during training; it should decrease steadily
- Higher beta values produce more conservative updates (closer to reference)
- Lower beta values allow the policy to deviate more from the reference