Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Hiyouga LLaMA Factory Sequence Packing Theory

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Training Efficiency
Last Updated 2026-02-06 19:00 GMT

Overview

Sequence packing is a training efficiency technique that concatenates multiple shorter sequences into a single batch entry up to the model's maximum context length, eliminating wasted computation on padding tokens.

Description

Standard batched training of language models requires all sequences in a batch to have the same length. Short sequences are padded to the length of the longest sequence in the batch, and these padding tokens consume computation and memory without contributing to learning. For datasets with highly variable sequence lengths, the proportion of wasted computation can be substantial.

Sequence packing solves this by concatenating multiple sequences end-to-end within a single batch entry. A key challenge is ensuring that the model's self-attention mechanism does not allow tokens from different packed sequences to attend to each other. LLaMA-Factory addresses this through block-diagonal attention masking, which constructs an attention mask that creates isolated attention blocks for each packed sequence.

The implementation uses two complementary components:

  • Packing logic in the data processor, which uses a greedy knapsack algorithm to optimally group sequences so their combined lengths approach but do not exceed the cutoff length.
  • Block-diagonal attention in the model layer, which monkey-patches FlashAttention's _get_unpad_data function to compute per-sequence cumulative sequence lengths (cu_seqlens) and indices, enabling FlashAttention's variable-length (varlen) kernels to handle packed sequences correctly.

The attention mask encodes packing information using integer labels rather than binary values: each position carries an integer indicating which original sequence it belongs to. For example, an attention mask of [1, 1, 2, 2, 2, 0] indicates two packed sequences (lengths 2 and 3) followed by one padding position.

Usage

Sequence packing should be used when:

  • Training on datasets with highly variable sequence lengths (e.g., instruction-tuning datasets where responses range from a few tokens to thousands).
  • GPU utilization is low due to excessive padding in standard batching.
  • Using FlashAttention-2, which natively supports variable-length sequences via its flash_attn_varlen interface.
  • The neat_packing option can be enabled to use int32 attention masks for datasets that require precise sequence boundary tracking.

Theoretical Basis

The core efficiency gain comes from the elimination of padding overhead. Given a batch of B sequences with lengths l1,l2,,lB and a maximum context length L, standard padding wastes:

Wasted tokens=BLi=1Bli

With packing, the utilization rate approaches:

Utilization=i=1BliBL

where BB is the number of packed batch entries, typically much smaller than B.

The greedy knapsack algorithm used for grouping sequences operates as follows:

def greedy_knapsack(numbers: list[int], capacity: int) -> list[list[int]]:
    numbers.sort()  # ascending order for binary search
    knapsacks = []
    while numbers:
        current_knapsack = []
        remaining_capacity = capacity
        while True:
            index = search_for_fit(numbers, remaining_capacity)
            if index == -1:
                break
            remaining_capacity -= numbers[index]
            current_knapsack.append(numbers.pop(index))
        knapsacks.append(current_knapsack)
    return knapsacks

This algorithm uses binary search (bisect) to efficiently find the largest sequence that fits within the remaining capacity of each knapsack, producing a near-optimal packing with O(nlogn) complexity.

For block-diagonal attention, the packed attention mask is transformed into FlashAttention-compatible data structures:

# Extract per-sequence lengths from integer-labeled attention mask
seqlens_in_batch = get_seqlens_in_batch(attention_mask)  # e.g., [2, 3, 1, 2, 3]

# Compute cumulative sequence lengths for varlen attention
cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
# e.g., [0, 2, 5, 6, 8, 11]

The cu_seqlens tensor defines the boundaries of each sequence within the packed representation, enabling FlashAttention's variable-length kernel to apply causal attention independently within each segment while preventing cross-sequence attention.

The sequence length inference function balances source and target truncation to preserve as much information as possible:

Failed to parse (syntax error): {\displaystyle \text{new\_target\_len} = \min\left(\text{max\_target\_len}, \text{target\_len}\right) }

where Failed to parse (syntax error): {\displaystyle \text{max\_target\_len}} is determined by the relative proportion of source and target in the available cutoff length.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment