Principle:Huggingface Transformers Sequence Packing

Knowledge Sources	Efficient Training Techniques Transformers Docs
Domains	Data_Processing, Training
Last Updated	2026-02-13 00:00 GMT

Overview

Sequence packing concatenates variable-length tokenized sequences into fixed-length blocks to eliminate padding waste and maximize GPU utilization during training.

Description

In language model training, input sequences from a dataset rarely have uniform length. The naive approach -- padding every sequence to the maximum length -- wastes significant computation on padding tokens that carry no training signal. Sequence packing solves this by concatenating all tokenized sequences into a single long stream of tokens and then slicing this stream into fixed-length blocks of size seq_len.

The packing process works as follows:

All tokenized sequences from the dataset are flattened into a single contiguous list of tokens.
This list is divided into blocks of seq_len + 1 tokens, where the extra token allows construction of input-label pairs: the input is the first seq_len tokens and the label is the last seq_len tokens (shifted by one position).
Each packed block is an independent training example with no padding tokens.

This approach ensures that every token in every training batch contributes to the loss, maximizing the effective throughput. It is especially important in distributed training where each GPU processes a local mini-batch and any padding waste is multiplied across all replicas.

Trade-off: Packing can cause cross-document attention, where attention spans across the boundary of two original documents concatenated within the same packed sequence. For many training scenarios (especially pretraining), this is acceptable. For fine-tuning tasks where document boundaries are semantically important, attention masking may be needed at document boundaries.

Usage

Use sequence packing when:

Pretraining or continued pretraining a causal language model where cross-document attention is acceptable.
The dataset contains variable-length sequences and you want to avoid padding waste.
Training with fixed-length context windows (as required by context parallelism).
Maximizing GPU utilization and training throughput is a priority.

Sequence packing is typically applied as a dataset preprocessing step before creating the DataLoader, and the packed dataset is then shuffled to randomize the order of packed blocks.

Theoretical Basis

The efficiency gain from packing can be quantified by the packing efficiency:

packing_efficiency = total_real_tokens / (num_sequences * seq_len)

With padding, if the average sequence length is avg_len and the maximum is seq_len, the efficiency is avg_len / seq_len. With packing, the efficiency approaches 1.0 (minus a small tail of discarded tokens at the end of the dataset).

For example, if the average sequence length is 256 tokens and seq_len = 1024, padding wastes 75% of compute. Packing recovers nearly all of this waste, delivering a 4x improvement in effective throughput.

In the context of distributed training, this efficiency gain is multiplied across all data-parallel ranks: if dp_size = 4, packing saves 4 * 0.75 * batch_size * seq_len wasted token computations per step.

For causal language models, packing preserves the autoregressive property within each block because the label for position i is the token at position i+1, regardless of which original document it came from.

Related Pages

Implemented By

Implementation:Huggingface_Transformers_Create_Packed_Sequences

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment