Principle:Lucidrains X transformers Preference Data Preparation
| Field | Value |
|---|---|
| Paper | Direct Preference Optimization |
| Repo | x-transformers |
| Domains | Data_Engineering, NLP, Alignment |
| Last Updated | 2026-02-08 18:00 GMT |
Overview
Data preparation pattern for creating preference pair datasets with prompt masks suitable for DPO alignment training.
Description
DPO (Direct Preference Optimization) training requires triplets of (preferred_seq, unpreferred_seq, prompt_mask). Both sequences must have the same shape. The prompt_mask is a boolean tensor where True indicates prompt tokens that are excluded from the DPO loss computation.
The key requirements of this pattern are:
- Preferred sequence: A complete sequence (prompt + preferred completion) as integer token IDs.
- Unpreferred sequence: A complete sequence (prompt + unpreferred completion) as integer token IDs, with the same shape as the preferred sequence.
- Prompt mask: A boolean tensor where
Truemarks prompt positions. The DPO loss is computed only on completion tokens (whereprompt_maskisFalse). - Both preferred and unpreferred sequences share the same prompt but differ in the completion portion.
No reference training script exists in the repository; the interface is derived from DPO.forward() at dpo.py:L71-117. The forward method asserts preferred_seq.shape == unpreferred_seq.shape.
Usage
Use this pattern when preparing preference data for DPO alignment training. Specifically:
- Collect human preference pairs: for each prompt, obtain a preferred and an unpreferred completion.
- Tokenize both completions and concatenate each with the prompt.
- Pad or truncate both sequences to the same length.
- Create a boolean prompt mask indicating which positions correspond to the shared prompt.
- Yield
(preferred, unpreferred, prompt_mask)tuples from aDatasetor generator.
Theoretical Basis
DPO (Rafailov et al., 2023) optimizes a policy model to prefer human-preferred completions over unpreferred ones, without needing an explicit reward model. The loss function compares log-probability ratios between the policy and a frozen reference model:
loss = -log_sigmoid(beta * ((pi_preferred - pi_unpreferred) - (ref_preferred - ref_unpreferred)))
where pi and ref denote the policy and reference model log-probabilities, respectively. The beta parameter controls the strength of the KL divergence constraint.
The prompt mask is critical because the DPO loss should only be computed on the completion tokens — the prompt is shared between both sequences and provides no preference signal. In the x-transformers implementation, the prompt mask is combined with optional padding masks via maybe_and_mask(~prompt_mask, seq_mask) to exclude both prompt and padding positions from the loss.