Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct Sequence Packing

From Leeroopedia


Knowledge Sources
Domains Training Efficiency Reinforcement Learning
Last Updated 2026-02-07 00:00 GMT

Overview

Sequence packing is the technique of concatenating multiple variable-length sequences into fixed-size buffers with 3D attention masks to eliminate padding waste during RL training.

Description

In GRPO training, each rollout produces completions of highly variable length. Some responses may be a few tokens (immediate answers), while others may be thousands of tokens (detailed reasoning chains). Without packing, each completion occupies a full pack_length buffer, with the unused positions filled by padding tokens that waste GPU compute and memory during the forward pass.

Sequence packing solves this by concatenating multiple query-response pairs into a single sequence, up to the pack_length limit. This approach:

  • Reduces wasted computation by eliminating padding.
  • Enables higher effective batch sizes within the same GPU memory budget.
  • Requires 3D attention masks (also called intra-document masks) to ensure that tokens in one packed sequence cannot attend to tokens from a different packed sequence.

The packing algorithm is a greedy first-fit approach: sequences are added to the current pack until the next sequence would exceed the pack length, at which point a new pack is started.

Usage

Sequence packing is applied after generation and reward computation, before the training step. It transforms the list of variable-length query-response pairs into a smaller list of fixed-size packed sequences that are ready for the training forward/backward pass.

Theoretical Basis

Packing Efficiency

The packing ratio measures how efficiently sequences are packed:

packing_ratio = num_packed_sequences / num_original_sequences

If packing_ratio = 0.5, each pack contains ~2 sequences on average
=> 2x reduction in batch dimension
=> ~2x less padding waste

3D Attention Masks

Standard causal attention masks are 2D (lower-triangular). Packed sequences require a 3D mask that creates block-diagonal attention patterns:

Pack = [seq_A (len 3) | seq_B (len 4) | padding (len 1)]

Standard 2D causal mask (WRONG - cross-attention between sequences):
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0
1 1 1 1 0 0 0 0
1 1 1 1 1 0 0 0
1 1 1 1 1 1 0 0
1 1 1 1 1 1 1 0
0 0 0 0 0 0 0 0

3D intra-document mask (CORRECT - isolation between sequences):
1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0
1 1 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 0 1 1 0 0 0
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 0
0 0 0 0 0 0 0 0

Each sequence attends only to itself, preserving the causal structure while allowing multiple sequences to share the same batch position.

Position IDs

With packed sequences, position IDs must be reset at each sequence boundary:

Pack = [seq_A (len 3) | seq_B (len 4) | padding]
Position IDs = [0, 1, 2, 0, 1, 2, 3, 0]

This ensures that rotary position embeddings (RoPE) correctly encode the position of each token within its own sequence.

Response Masks

The response mask tracks which tokens correspond to model-generated responses (as opposed to prompt tokens). This is critical because:

  • Loss is only computed on response tokens.
  • Advantages are only applied to response tokens.
  • Log-probabilities are only compared for response tokens.

The response mask also supports tool use masking: when a tool is called, the tool output tokens are injected into the response but should not receive gradient signal (the model did not generate them). These tokens are masked out via the tool mask mechanism.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment