Principle:Allenai Open instruct Sequence Packing

Knowledge Sources	Packing: Towards 2x NLP BERT Acceleration Sequence Packing for LLMs
Domains	Training Efficiency Reinforcement Learning
Last Updated	2026-02-07 00:00 GMT

Overview

Sequence packing is the technique of concatenating multiple variable-length sequences into fixed-size buffers with 3D attention masks to eliminate padding waste during RL training.

Description

In GRPO training, each rollout produces completions of highly variable length. Some responses may be a few tokens (immediate answers), while others may be thousands of tokens (detailed reasoning chains). Without packing, each completion occupies a full pack_length buffer, with the unused positions filled by padding tokens that waste GPU compute and memory during the forward pass.

Sequence packing solves this by concatenating multiple query-response pairs into a single sequence, up to the pack_length limit. This approach:

Reduces wasted computation by eliminating padding.
Enables higher effective batch sizes within the same GPU memory budget.
Requires 3D attention masks (also called intra-document masks) to ensure that tokens in one packed sequence cannot attend to tokens from a different packed sequence.

The packing algorithm is a greedy first-fit approach: sequences are added to the current pack until the next sequence would exceed the pack length, at which point a new pack is started.

Usage

Sequence packing is applied after generation and reward computation, before the training step. It transforms the list of variable-length query-response pairs into a smaller list of fixed-size packed sequences that are ready for the training forward/backward pass.

Theoretical Basis

Packing Efficiency

The packing ratio measures how efficiently sequences are packed:

packing_ratio = num_packed_sequences / num_original_sequences

If packing_ratio = 0.5, each pack contains ~2 sequences on average
=> 2x reduction in batch dimension
=> ~2x less padding waste

3D Attention Masks

Standard causal attention masks are 2D (lower-triangular). Packed sequences require a 3D mask that creates block-diagonal attention patterns:

Pack = [seq_A (len 3) | seq_B (len 4) | padding (len 1)]

Standard 2D causal mask (WRONG - cross-attention between sequences):
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0
1 1 1 1 0 0 0 0
1 1 1 1 1 0 0 0
1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 0
0 0 0 0 0 0 0 0

3D intra-document mask (CORRECT - isolation between sequences):
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 1 1 0 0 0
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 0
0 0 0 0 0 0 0 0

Each sequence attends only to itself, preserving the causal structure while allowing multiple sequences to share the same batch position.

Position IDs

With packed sequences, position IDs must be reset at each sequence boundary:

Pack = [seq_A (len 3) | seq_B (len 4) | padding]
Position IDs = [0, 1, 2, 0, 1, 2, 3, 0]

This ensures that rotary position embeddings (RoPE) correctly encode the position of each token within its own sequence.

Response Masks

The response mask tracks which tokens correspond to model-generated responses (as opposed to prompt tokens). This is critical because:

Loss is only computed on response tokens.
Advantages are only applied to response tokens.
Log-probabilities are only compared for response tokens.

The response mask also supports tool use masking: when a tool is called, the tool output tokens are injected into the response but should not receive gradient signal (the model did not generate them). These tokens are masked out via the tool mask mechanism.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_Pack_Sequences

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment