Principle:Allenai Open instruct Preference Data Processing
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Preference Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Preference data processing transforms raw human preference datasets -- consisting of chosen and rejected response pairs for a given prompt -- into tokenized, filtered tensors suitable for training DPO and reward models.
Description
In preference-based alignment methods such as Direct Preference Optimization (DPO) and Reward Model (RM) training, the training data takes the form of triplets: a prompt, a chosen (preferred) response, and a rejected (dispreferred) response. Before these triplets can be consumed by a model, they must undergo several processing stages:
Tokenization: Each chosen and rejected response is converted to token IDs using a chat template. The prompt portion is extracted from the chosen response (all messages except the final one) and tokenized separately with a generation prompt appended. This yields three sequences per example:
- input_ids_prompt -- the tokenized prompt with generation prompt marker
- input_ids_chosen -- the full tokenized chosen conversation (prompt + response)
- input_ids_rejected -- the full tokenized rejected conversation (prompt + response)
Corresponding attention masks are constructed as all-ones vectors matching each sequence length.
Prompt Masking: By tokenizing the prompt separately, the training pipeline can later mask out the prompt tokens from the loss computation. Only response tokens contribute to the DPO or reward model loss, ensuring the model learns to distinguish response quality rather than memorize prompts.
Filtering: After tokenization, examples that exceed configured maximum lengths (for the prompt or full sequence) are discarded. This ensures uniform batch construction and prevents out-of-memory errors during training. The filter reports the percentage of discarded examples for monitoring data quality.
Usage
Use preference data processing whenever preparing datasets for:
- DPO training: The primary consumer, requiring chosen/rejected pairs with prompt masks.
- Reward model training: Similar data format, scoring chosen vs. rejected responses.
- Offline preference evaluation: Pre-tokenized data enables consistent evaluation across experiments.
Theoretical Basis
Preference data encodes a Bradley-Terry model of human preferences:
where is the chosen response, is the rejected response, and is the latent reward function. Correct tokenization ensures that the model's log-probabilities are computed over the response tokens only, by masking the prompt portion (identified via separate prompt tokenization). This masking is essential because the DPO loss:
requires log-probabilities conditioned on the prompt , computed only over the response tokens.