Principle:Allenai Open instruct Preference Collation

Knowledge Sources	Learning to summarize from human feedback Training language models to follow instructions with human feedback
Domains	Reinforcement Learning from Human Feedback, Reward Modeling, Data Processing
Last Updated	2026-02-07 00:00 GMT

Overview

Preference collation is the process of batching and padding variable-length chosen/rejected token sequence pairs into uniform-size tensors suitable for efficient batch processing by a reward model during training and evaluation.

Description

Preference datasets for reward model training consist of pairs of tokenized sequences: a "chosen" completion and a "rejected" completion for each prompt. These sequences typically have different lengths both within a pair (the chosen response may be longer or shorter than the rejected response) and across different examples in the dataset.

To process these sequences efficiently in mini-batches on GPUs, they must be collated into uniform-size tensors. Preference collation addresses this by:

Finding the maximum sequence length across all chosen and rejected sequences in the batch, taking the single global maximum.
Padding all sequences (both chosen and rejected) to this maximum length using the tokenizer's padding token.
Converting the padded sequences to PyTorch tensors.

A critical design decision is that both chosen and rejected sequences are padded to the same maximum length. This is important because in the training loop, chosen and rejected sequences are concatenated along the batch dimension before being passed through the reward model. If they had different padding lengths, the concatenation would fail or require additional handling.

Padding Direction

The standard approach in Open Instruct is to use right-padding (post-padding): padding tokens are appended to the end of sequences. This is the natural choice for causal (decoder-only) transformers because:

Causal attention: The attention mask naturally prevents padding tokens from attending to or being attended by real tokens on their left.
Positional encoding consistency: Real tokens maintain their original positional indices starting from 0.
Last-token extraction: The reward is extracted from the last non-padding token, which is easier to locate with right-padding (it is always before the first padding token).

Left-padding would shift the position indices of real tokens, requiring more complex position ID handling and potentially degrading model performance if the model was pre-trained with right-padding.

Usage

Use preference collation whenever:

Training a reward model on preference data and need to batch variable-length sequences.
Evaluating a reward model on preference pairs.
Creating a PyTorch DataLoader that handles preference datasets with the collate_fn parameter.

Theoretical Basis

Batch Construction

Given a batch of $B$ preference examples, each containing chosen tokens $c_{i} = (c_{i, 1}, \dots, c_{i, T_{i}^{c}})$ and rejected tokens $r_{i} = (r_{i, 1}, \dots, r_{i, T_{i}^{r}})$ :

Step 1: Determine maximum length $T_{\max} = \max (\max_{}^{i = 1} T_{i}^{c}, \max_{}^{i = 1} T_{i}^{r})$

Step 2: Pad all sequences to $T_{\max}$ ${\hat{c}}_{i} = (c_{i, 1}, \dots, c_{i, T_{i}^{c}}, \underset{T_{\max} - T_{i}^{c}}{\underset{⏟}{PAD, \dots, PAD}})$ ${\hat{r}}_{i} = (r_{i, 1}, \dots, r_{i, T_{i}^{r}}, \underset{T_{\max} - T_{i}^{r}}{\underset{⏟}{PAD, \dots, PAD}})$

Step 3: Stack into tensors $C \in ℤ^{B \times T_{\max}}, R \in ℤ^{B \times T_{\max}}$

Memory Efficiency

The padding overhead depends on the variance of sequence lengths in the batch. For a batch where the longest sequence has length $T_{\max}$ and the average length is $\bar{T}$ :

$Padding ratio = 1 - \frac{\bar{T}}{T_{\max}}$

Higher variance in sequence lengths leads to more wasted computation on padding tokens. Strategies to mitigate this include:

Length-based bucketing: Grouping sequences of similar length into the same batch.
Dynamic batching: Adjusting batch size based on total token count rather than fixed example count.
Truncation: Enforcing a maximum token length during dataset preparation (done via max_token_length in Open Instruct's dataset transformation step).

Uniform Padding for Chosen/Rejected

Using the same $T_{\max}$ for both chosen and rejected sequences (rather than separate maxima) ensures that:

$shape (C) = shape (R) = (B, T_{\max})$

This enables the concatenation operation in the training loop:

Failed to parse (syntax error): {\displaystyle \text{query\_responses} = \text{cat}(C, R, \text{dim}=0) \in \mathbb{Z}^{2B \times T_{\max}}}

which doubles the effective batch size for a single forward pass while maintaining alignment between chosen and rejected indices.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_SimplePreferenceCollator

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment