Principle: DPO_Wrapper_Setup

Metadata

Field	Value
Page Type	Principle
Knowledge Sources	Paper (Direct Preference Optimization — Rafailov et al.), Repo (x-transformers)
Domains	Deep_Learning, NLP, Alignment
Last Updated	2026-02-08 18:00 GMT

Overview

Initialization pattern for Direct Preference Optimization that creates a trainable policy model and a frozen reference model from a pretrained transformer.

Description

Direct Preference Optimization (DPO) requires two copies of the model: a trainable policy model and a frozen reference model. The DPO wrapper takes a pretrained TransformerWrapper, deep-copies it to create the frozen reference, and exposes only the policy model's parameters for optimization. The beta parameter controls how much the policy is allowed to deviate from the reference distribution.

This setup avoids the need for a separate reward model, unlike RLHF with PPO. Instead, DPO reformulates the reward-maximization objective so that the policy can be optimized directly from preference pairs without any intermediate reward function.

The initialization proceeds as follows:

The pretrained TransformerWrapper is stored as self.policy_model -- this is the model that will be trained.
A deep copy of the model is created and stored as self.ref_model -- all of its parameters are frozen (requires_grad=False).
The .parameters() method is overridden to return only the policy model's parameters, ensuring that optimizers only update the trainable copy.
An optional pad_id parameter enables automatic creation of padding masks during the forward pass.

Usage

Use after pretraining a language model, when you want to align it with human preferences without training a reward model. Initialize with a pretrained TransformerWrapper:

Pretrain a base language model using standard autoregressive training.
Wrap the pretrained model with the DPO class.
Create an optimizer over dpo.parameters() (which returns only the policy model's parameters).
Train on preference pairs using the DPO forward pass.

Theoretical Basis

DPO as Reward-Free RLHF

DPO reformulates the RLHF objective to directly optimize the policy using preference pairs, without an explicit reward model. The key insight from Rafailov et al. is that the optimal policy under a KL-constrained reward maximization has a closed-form relationship to the reward function:

r(x, y) = β · log(π(y|x) / π_ref(y|x)) + β · log Z(x)

Where:

r(x, y) is the implicit reward for response y given prompt x.
π(y|x) is the policy model's probability of generating y given x.
π_ref(y|x) is the reference model's probability.
β is the temperature parameter controlling deviation from the reference.
Z(x) is the partition function (which cancels out in the DPO loss).

The reference model π_ref provides the baseline distribution; the policy π is trained to increase the probability of preferred completions relative to unpreferred ones. Because the partition function cancels in the preference comparison, DPO can train directly from preference data without fitting a reward model.

Related Pages

Implemented By

Implementation:Lucidrains_X_transformers_DPO_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment