Principle:Allenai Open instruct Sequence Packing
| Knowledge Sources | |
|---|---|
| Domains | Training Efficiency Reinforcement Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Sequence packing is the technique of concatenating multiple variable-length sequences into fixed-size buffers with 3D attention masks to eliminate padding waste during RL training.
Description
In GRPO training, each rollout produces completions of highly variable length. Some responses may be a few tokens (immediate answers), while others may be thousands of tokens (detailed reasoning chains). Without packing, each completion occupies a full pack_length buffer, with the unused positions filled by padding tokens that waste GPU compute and memory during the forward pass.
Sequence packing solves this by concatenating multiple query-response pairs into a single sequence, up to the pack_length limit. This approach:
- Reduces wasted computation by eliminating padding.
- Enables higher effective batch sizes within the same GPU memory budget.
- Requires 3D attention masks (also called intra-document masks) to ensure that tokens in one packed sequence cannot attend to tokens from a different packed sequence.
The packing algorithm is a greedy first-fit approach: sequences are added to the current pack until the next sequence would exceed the pack length, at which point a new pack is started.
Usage
Sequence packing is applied after generation and reward computation, before the training step. It transforms the list of variable-length query-response pairs into a smaller list of fixed-size packed sequences that are ready for the training forward/backward pass.
Theoretical Basis
Packing Efficiency
The packing ratio measures how efficiently sequences are packed:
packing_ratio = num_packed_sequences / num_original_sequences
If packing_ratio = 0.5, each pack contains ~2 sequences on average
=> 2x reduction in batch dimension
=> ~2x less padding waste
3D Attention Masks
Standard causal attention masks are 2D (lower-triangular). Packed sequences require a 3D mask that creates block-diagonal attention patterns:
Pack = [seq_A (len 3) | seq_B (len 4) | padding (len 1)]
Standard 2D causal mask (WRONG - cross-attention between sequences):
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0
1 1 1 1 0 0 0 0
1 1 1 1 1 0 0 0
1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 0
0 0 0 0 0 0 0 0
3D intra-document mask (CORRECT - isolation between sequences):
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 1 1 0 0 0
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 0
0 0 0 0 0 0 0 0
Each sequence attends only to itself, preserving the causal structure while allowing multiple sequences to share the same batch position.
Position IDs
With packed sequences, position IDs must be reset at each sequence boundary:
Pack = [seq_A (len 3) | seq_B (len 4) | padding]
Position IDs = [0, 1, 2, 0, 1, 2, 3, 0]
This ensures that rotary position embeddings (RoPE) correctly encode the position of each token within its own sequence.
Response Masks
The response mask tracks which tokens correspond to model-generated responses (as opposed to prompt tokens). This is critical because:
- Loss is only computed on response tokens.
- Advantages are only applied to response tokens.
- Log-probabilities are only compared for response tokens.
The response mask also supports tool use masking: when a tool is called, the tool output tokens are injected into the response but should not receive gradient signal (the model did not generate them). These tokens are masked out via the tool mask mechanism.