Principle:Lucidrains X transformers DPO Wrapper Setup
Principle: DPO_Wrapper_Setup
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Paper (Direct Preference Optimization — Rafailov et al.), Repo (x-transformers) |
| Domains | Deep_Learning, NLP, Alignment |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Initialization pattern for Direct Preference Optimization that creates a trainable policy model and a frozen reference model from a pretrained transformer.
Description
Direct Preference Optimization (DPO) requires two copies of the model: a trainable policy model and a frozen reference model. The DPO wrapper takes a pretrained TransformerWrapper, deep-copies it to create the frozen reference, and exposes only the policy model's parameters for optimization. The beta parameter controls how much the policy is allowed to deviate from the reference distribution.
This setup avoids the need for a separate reward model, unlike RLHF with PPO. Instead, DPO reformulates the reward-maximization objective so that the policy can be optimized directly from preference pairs without any intermediate reward function.
The initialization proceeds as follows:
- The pretrained TransformerWrapper is stored as
self.policy_model-- this is the model that will be trained. - A deep copy of the model is created and stored as
self.ref_model-- all of its parameters are frozen (requires_grad=False). - The
.parameters()method is overridden to return only the policy model's parameters, ensuring that optimizers only update the trainable copy. - An optional
pad_idparameter enables automatic creation of padding masks during the forward pass.
Usage
Use after pretraining a language model, when you want to align it with human preferences without training a reward model. Initialize with a pretrained TransformerWrapper:
- Pretrain a base language model using standard autoregressive training.
- Wrap the pretrained model with the DPO class.
- Create an optimizer over
dpo.parameters()(which returns only the policy model's parameters). - Train on preference pairs using the DPO forward pass.
Theoretical Basis
DPO as Reward-Free RLHF
DPO reformulates the RLHF objective to directly optimize the policy using preference pairs, without an explicit reward model. The key insight from Rafailov et al. is that the optimal policy under a KL-constrained reward maximization has a closed-form relationship to the reward function:
r(x, y) = β · log(π(y|x) / π_ref(y|x)) + β · log Z(x)
Where:
r(x, y)is the implicit reward for responseygiven promptx.π(y|x)is the policy model's probability of generatingygivenx.π_ref(y|x)is the reference model's probability.βis the temperature parameter controlling deviation from the reference.Z(x)is the partition function (which cancels out in the DPO loss).
The reference model π_ref provides the baseline distribution; the policy π is trained to increase the probability of preferred completions relative to unpreferred ones. Because the partition function cancels in the preference comparison, DPO can train directly from preference data without fitting a reward model.