Principle:Allenai Open instruct Preference Data Processing

Knowledge Sources	DPO RLHF
Domains	Machine Learning, Natural Language Processing, Preference Learning
Last Updated	2026-02-07 00:00 GMT

Overview

Preference data processing transforms raw human preference datasets -- consisting of chosen and rejected response pairs for a given prompt -- into tokenized, filtered tensors suitable for training DPO and reward models.

Description

In preference-based alignment methods such as Direct Preference Optimization (DPO) and Reward Model (RM) training, the training data takes the form of triplets: a prompt, a chosen (preferred) response, and a rejected (dispreferred) response. Before these triplets can be consumed by a model, they must undergo several processing stages:

Tokenization: Each chosen and rejected response is converted to token IDs using a chat template. The prompt portion is extracted from the chosen response (all messages except the final one) and tokenized separately with a generation prompt appended. This yields three sequences per example:

input_ids_prompt -- the tokenized prompt with generation prompt marker
input_ids_chosen -- the full tokenized chosen conversation (prompt + response)
input_ids_rejected -- the full tokenized rejected conversation (prompt + response)

Corresponding attention masks are constructed as all-ones vectors matching each sequence length.

Prompt Masking: By tokenizing the prompt separately, the training pipeline can later mask out the prompt tokens from the loss computation. Only response tokens contribute to the DPO or reward model loss, ensuring the model learns to distinguish response quality rather than memorize prompts.

Filtering: After tokenization, examples that exceed configured maximum lengths (for the prompt or full sequence) are discarded. This ensures uniform batch construction and prevents out-of-memory errors during training. The filter reports the percentage of discarded examples for monitoring data quality.

Usage

Use preference data processing whenever preparing datasets for:

DPO training: The primary consumer, requiring chosen/rejected pairs with prompt masks.
Reward model training: Similar data format, scoring chosen vs. rejected responses.
Offline preference evaluation: Pre-tokenized data enables consistent evaluation across experiments.

Theoretical Basis

Preference data encodes a Bradley-Terry model of human preferences:

$P (y_{w} ≻ y_{l} ∣ x) = σ (r^{*} (x, y_{w}) - r^{*} (x, y_{l}))$

where $y_{w}$ is the chosen response, $y_{l}$ is the rejected response, and $r^{*}$ is the latent reward function. Correct tokenization ensures that the model's log-probabilities are computed over the response tokens only, by masking the prompt portion (identified via separate prompt tokenization). This masking is essential because the DPO loss:

$ℒ_{DPO} = - 𝔼 [\log σ (β (\log \frac{π_{θ} (y_{w} | x)}{π_{ref} (y_{w} | x)} - \log \frac{π_{θ} (y_{l} | x)}{π_{ref} (y_{l} | x)}))]$

requires log-probabilities conditioned on the prompt $x$ , computed only over the response tokens.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_PreferenceDatasetProcessor

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment