Principle:Lucidrains X transformers Preference Data Preparation

Field	Value
Paper	Direct Preference Optimization
Repo	x-transformers
Domains	Data_Engineering, NLP, Alignment
Last Updated	2026-02-08 18:00 GMT

Overview

Data preparation pattern for creating preference pair datasets with prompt masks suitable for DPO alignment training.

Description

DPO (Direct Preference Optimization) training requires triplets of (preferred_seq, unpreferred_seq, prompt_mask). Both sequences must have the same shape. The prompt_mask is a boolean tensor where True indicates prompt tokens that are excluded from the DPO loss computation.

The key requirements of this pattern are:

Preferred sequence: A complete sequence (prompt + preferred completion) as integer token IDs.
Unpreferred sequence: A complete sequence (prompt + unpreferred completion) as integer token IDs, with the same shape as the preferred sequence.
Prompt mask: A boolean tensor where True marks prompt positions. The DPO loss is computed only on completion tokens (where prompt_mask is False).
Both preferred and unpreferred sequences share the same prompt but differ in the completion portion.

No reference training script exists in the repository; the interface is derived from DPO.forward() at dpo.py:L71-117. The forward method asserts preferred_seq.shape == unpreferred_seq.shape.

Usage

Use this pattern when preparing preference data for DPO alignment training. Specifically:

Collect human preference pairs: for each prompt, obtain a preferred and an unpreferred completion.
Tokenize both completions and concatenate each with the prompt.
Pad or truncate both sequences to the same length.
Create a boolean prompt mask indicating which positions correspond to the shared prompt.
Yield (preferred, unpreferred, prompt_mask) tuples from a Dataset or generator.

Theoretical Basis

DPO (Rafailov et al., 2023) optimizes a policy model to prefer human-preferred completions over unpreferred ones, without needing an explicit reward model. The loss function compares log-probability ratios between the policy and a frozen reference model:

loss = -log_sigmoid(beta * ((pi_preferred - pi_unpreferred) - (ref_preferred - ref_unpreferred)))

where pi and ref denote the policy and reference model log-probabilities, respectively. The beta parameter controls the strength of the KL divergence constraint.

The prompt mask is critical because the DPO loss should only be computed on the completion tokens — the prompt is shared between both sequences and provides no preference signal. In the x-transformers implementation, the prompt mask is combined with optional padding masks via maybe_and_mask(~prompt_mask, seq_mask) to exclude both prompt and padding positions from the loss.

Related Pages

Implementation:Lucidrains_X_transformers_Preference_Dataset_Pattern

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment