Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Lucidrains X transformers DPO Preference Alignment

From Leeroopedia


Knowledge Sources
Domains RLHF, Preference_Optimization, LLM_Alignment
Last Updated 2026-02-08 18:00 GMT

Overview

End-to-end process for aligning a pretrained autoregressive language model with human preferences using Direct Preference Optimization (DPO) via x-transformers.

Description

This workflow covers the process of fine-tuning a pretrained x-transformers decoder model using Direct Preference Optimization (DPO), a method for aligning language models with human preferences without requiring a separate reward model. The DPO class wraps a pretrained TransformerWrapper policy model and automatically creates a frozen reference copy. Given pairs of preferred and unpreferred completions for the same prompt, DPO optimizes the policy model to increase the likelihood ratio of preferred over unpreferred responses relative to the reference model. This implements the algorithm from Rafailov et al. (2023), providing a simpler alternative to PPO-based RLHF.

Usage

Execute this workflow after you have a pretrained autoregressive language model built with x-transformers that you want to align with human preferences. You need a dataset of preference pairs: for each prompt, a preferred completion (chosen by human annotators) and an unpreferred completion. This workflow is appropriate when you want to improve the quality, safety, or helpfulness of model outputs without the complexity of training a separate reward model and running PPO.

Execution Steps

Step 1: Pretrain Base Model

Start with a pretrained TransformerWrapper decoder model. This can be trained using the Autoregressive Language Modeling workflow or loaded from a checkpoint. The model should have reasonable language modeling capabilities before preference alignment.

Key considerations:

  • The base model should already generate coherent text
  • The model must be a TransformerWrapper instance (required by the DPO class)
  • Save a checkpoint before alignment so you can compare before/after quality

Step 2: Prepare Preference Dataset

Construct a dataset of preference pairs. Each sample consists of a prompt, a preferred completion, and an unpreferred completion. Both completions should be concatenated with the prompt and tokenized to the same sequence length, with a prompt mask indicating which tokens are part of the prompt (excluded from the DPO loss).

Key considerations:

  • Preferred and unpreferred sequences must have the same shape (batch, seq_len)
  • The prompt_mask is a boolean tensor where True indicates prompt tokens (these are excluded from the preference loss)
  • Optional pad_id can be set to automatically derive sequence masks for variable-length sequences
  • Both sequences should start with the same prompt tokens

Step 3: Initialize DPO Wrapper

Wrap the pretrained TransformerWrapper in the DPO class. This automatically creates a frozen deep copy of the model as the reference model. Configure the beta parameter that controls the strength of the KL divergence constraint against the reference model.

What happens:

  • The policy model (trainable) is the original model
  • A reference model (frozen, no gradients) is created as a deep copy
  • The beta parameter (default 0.1) controls how much the policy can deviate from the reference
  • Only the policy model's parameters are exposed for optimization

Step 4: Train with DPO

Run the DPO training loop. Each step feeds preferred sequences, unpreferred sequences, and the prompt mask into the DPO wrapper. The wrapper computes log probabilities under both the policy and reference models, then optimizes the DPO loss to increase the relative likelihood of preferred completions.

What happens:

  • Reference model computes log probabilities for both preferred and unpreferred sequences (no gradient)
  • Policy model computes log probabilities for both sequences (with gradient)
  • The DPO loss is: -log_sigmoid(beta * (policy_log_ratio - ref_log_ratio))
  • Log ratios are computed as mean log probability of preferred minus unpreferred
  • Prompt tokens are masked out so only completion tokens affect the loss

Step 5: Evaluate Aligned Model

Compare the aligned policy model's generations against the original pretrained model. Use the policy model (accessible via dpo_wrapper.policy_model) for generation with the AutoregressiveWrapper or directly via the TransformerWrapper.

Key considerations:

  • The aligned model should produce outputs more consistent with the preference data
  • Monitor the DPO loss during training; it should decrease steadily
  • Higher beta values produce more conservative updates (closer to reference)
  • Lower beta values allow the policy to deviate more from the reference

Execution Diagram

GitHub URL

Workflow Repository