Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Lucidrains X transformers Preference Data Preparation

From Leeroopedia


Field Value
Paper Direct Preference Optimization
Repo x-transformers
Domains Data_Engineering, NLP, Alignment
Last Updated 2026-02-08 18:00 GMT

Overview

Data preparation pattern for creating preference pair datasets with prompt masks suitable for DPO alignment training.

Description

DPO (Direct Preference Optimization) training requires triplets of (preferred_seq, unpreferred_seq, prompt_mask). Both sequences must have the same shape. The prompt_mask is a boolean tensor where True indicates prompt tokens that are excluded from the DPO loss computation.

The key requirements of this pattern are:

  • Preferred sequence: A complete sequence (prompt + preferred completion) as integer token IDs.
  • Unpreferred sequence: A complete sequence (prompt + unpreferred completion) as integer token IDs, with the same shape as the preferred sequence.
  • Prompt mask: A boolean tensor where True marks prompt positions. The DPO loss is computed only on completion tokens (where prompt_mask is False).
  • Both preferred and unpreferred sequences share the same prompt but differ in the completion portion.

No reference training script exists in the repository; the interface is derived from DPO.forward() at dpo.py:L71-117. The forward method asserts preferred_seq.shape == unpreferred_seq.shape.

Usage

Use this pattern when preparing preference data for DPO alignment training. Specifically:

  • Collect human preference pairs: for each prompt, obtain a preferred and an unpreferred completion.
  • Tokenize both completions and concatenate each with the prompt.
  • Pad or truncate both sequences to the same length.
  • Create a boolean prompt mask indicating which positions correspond to the shared prompt.
  • Yield (preferred, unpreferred, prompt_mask) tuples from a Dataset or generator.

Theoretical Basis

DPO (Rafailov et al., 2023) optimizes a policy model to prefer human-preferred completions over unpreferred ones, without needing an explicit reward model. The loss function compares log-probability ratios between the policy and a frozen reference model:

loss = -log_sigmoid(beta * ((pi_preferred - pi_unpreferred) - (ref_preferred - ref_unpreferred)))

where pi and ref denote the policy and reference model log-probabilities, respectively. The beta parameter controls the strength of the KL divergence constraint.

The prompt mask is critical because the DPO loss should only be computed on the completion tokens — the prompt is shared between both sequences and provides no preference signal. In the x-transformers implementation, the prompt mask is combined with optional padding masks via maybe_and_mask(~prompt_mask, seq_mask) to exclude both prompt and padding positions from the loss.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment