Principle:Allenai Open instruct Preference Collation
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning from Human Feedback, Reward Modeling, Data Processing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Preference collation is the process of batching and padding variable-length chosen/rejected token sequence pairs into uniform-size tensors suitable for efficient batch processing by a reward model during training and evaluation.
Description
Preference datasets for reward model training consist of pairs of tokenized sequences: a "chosen" completion and a "rejected" completion for each prompt. These sequences typically have different lengths both within a pair (the chosen response may be longer or shorter than the rejected response) and across different examples in the dataset.
To process these sequences efficiently in mini-batches on GPUs, they must be collated into uniform-size tensors. Preference collation addresses this by:
- Finding the maximum sequence length across all chosen and rejected sequences in the batch, taking the single global maximum.
- Padding all sequences (both chosen and rejected) to this maximum length using the tokenizer's padding token.
- Converting the padded sequences to PyTorch tensors.
A critical design decision is that both chosen and rejected sequences are padded to the same maximum length. This is important because in the training loop, chosen and rejected sequences are concatenated along the batch dimension before being passed through the reward model. If they had different padding lengths, the concatenation would fail or require additional handling.
Padding Direction
The standard approach in Open Instruct is to use right-padding (post-padding): padding tokens are appended to the end of sequences. This is the natural choice for causal (decoder-only) transformers because:
- Causal attention: The attention mask naturally prevents padding tokens from attending to or being attended by real tokens on their left.
- Positional encoding consistency: Real tokens maintain their original positional indices starting from 0.
- Last-token extraction: The reward is extracted from the last non-padding token, which is easier to locate with right-padding (it is always before the first padding token).
Left-padding would shift the position indices of real tokens, requiring more complex position ID handling and potentially degrading model performance if the model was pre-trained with right-padding.
Usage
Use preference collation whenever:
- Training a reward model on preference data and need to batch variable-length sequences.
- Evaluating a reward model on preference pairs.
- Creating a PyTorch DataLoader that handles preference datasets with the
collate_fnparameter.
Theoretical Basis
Batch Construction
Given a batch of preference examples, each containing chosen tokens and rejected tokens :
Step 1: Determine maximum length
Step 2: Pad all sequences to
Step 3: Stack into tensors
Memory Efficiency
The padding overhead depends on the variance of sequence lengths in the batch. For a batch where the longest sequence has length and the average length is :
Higher variance in sequence lengths leads to more wasted computation on padding tokens. Strategies to mitigate this include:
- Length-based bucketing: Grouping sequences of similar length into the same batch.
- Dynamic batching: Adjusting batch size based on total token count rather than fixed example count.
- Truncation: Enforcing a maximum token length during dataset preparation (done via
max_token_lengthin Open Instruct's dataset transformation step).
Uniform Padding for Chosen/Rejected
Using the same for both chosen and rejected sequences (rather than separate maxima) ensures that:
This enables the concatenation operation in the training loop:
Failed to parse (syntax error): {\displaystyle \text{query\_responses} = \text{cat}(C, R, \text{dim}=0) \in \mathbb{Z}^{2B \times T_{\max}}}
which doubles the effective batch size for a single forward pass while maintaining alignment between chosen and rejected indices.