Principle:Hiyouga LLaMA Factory Sequence Packing Theory
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Training Efficiency |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Sequence packing is a training efficiency technique that concatenates multiple shorter sequences into a single batch entry up to the model's maximum context length, eliminating wasted computation on padding tokens.
Description
Standard batched training of language models requires all sequences in a batch to have the same length. Short sequences are padded to the length of the longest sequence in the batch, and these padding tokens consume computation and memory without contributing to learning. For datasets with highly variable sequence lengths, the proportion of wasted computation can be substantial.
Sequence packing solves this by concatenating multiple sequences end-to-end within a single batch entry. A key challenge is ensuring that the model's self-attention mechanism does not allow tokens from different packed sequences to attend to each other. LLaMA-Factory addresses this through block-diagonal attention masking, which constructs an attention mask that creates isolated attention blocks for each packed sequence.
The implementation uses two complementary components:
- Packing logic in the data processor, which uses a greedy knapsack algorithm to optimally group sequences so their combined lengths approach but do not exceed the cutoff length.
- Block-diagonal attention in the model layer, which monkey-patches FlashAttention's
_get_unpad_datafunction to compute per-sequence cumulative sequence lengths (cu_seqlens) and indices, enabling FlashAttention's variable-length (varlen) kernels to handle packed sequences correctly.
The attention mask encodes packing information using integer labels rather than binary values: each position carries an integer indicating which original sequence it belongs to. For example, an attention mask of [1, 1, 2, 2, 2, 0] indicates two packed sequences (lengths 2 and 3) followed by one padding position.
Usage
Sequence packing should be used when:
- Training on datasets with highly variable sequence lengths (e.g., instruction-tuning datasets where responses range from a few tokens to thousands).
- GPU utilization is low due to excessive padding in standard batching.
- Using FlashAttention-2, which natively supports variable-length sequences via its
flash_attn_varleninterface. - The
neat_packingoption can be enabled to use int32 attention masks for datasets that require precise sequence boundary tracking.
Theoretical Basis
The core efficiency gain comes from the elimination of padding overhead. Given a batch of sequences with lengths and a maximum context length , standard padding wastes:
With packing, the utilization rate approaches:
where is the number of packed batch entries, typically much smaller than .
The greedy knapsack algorithm used for grouping sequences operates as follows:
def greedy_knapsack(numbers: list[int], capacity: int) -> list[list[int]]:
numbers.sort() # ascending order for binary search
knapsacks = []
while numbers:
current_knapsack = []
remaining_capacity = capacity
while True:
index = search_for_fit(numbers, remaining_capacity)
if index == -1:
break
remaining_capacity -= numbers[index]
current_knapsack.append(numbers.pop(index))
knapsacks.append(current_knapsack)
return knapsacks
This algorithm uses binary search (bisect) to efficiently find the largest sequence that fits within the remaining capacity of each knapsack, producing a near-optimal packing with complexity.
For block-diagonal attention, the packed attention mask is transformed into FlashAttention-compatible data structures:
# Extract per-sequence lengths from integer-labeled attention mask
seqlens_in_batch = get_seqlens_in_batch(attention_mask) # e.g., [2, 3, 1, 2, 3]
# Compute cumulative sequence lengths for varlen attention
cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
# e.g., [0, 2, 5, 6, 8, 11]
The cu_seqlens tensor defines the boundaries of each sequence within the packed representation, enabling FlashAttention's variable-length kernel to apply causal attention independently within each segment while preventing cross-sequence attention.
The sequence length inference function balances source and target truncation to preserve as much information as possible:
Failed to parse (syntax error): {\displaystyle \text{new\_target\_len} = \min\left(\text{max\_target\_len}, \text{target\_len}\right) }
where Failed to parse (syntax error): {\displaystyle \text{max\_target\_len}} is determined by the relative proportion of source and target in the available cutoff length.