Principle:Axolotl ai cloud Axolotl Sequence Packing

Knowledge Sources	Packing: Towards 2x NLP BERT Speed Multipack Sampler Axolotl
Domains	Training_Efficiency, Data_Loading, Optimization
Last Updated	2026-02-06 23:00 GMT

Overview

A batch construction technique that packs multiple variable-length sequences into fixed-capacity bins to maximize GPU utilization and minimize padding waste during training.

Description

Sequence Packing (also called Sample Packing) addresses the fundamental inefficiency of training with variable-length sequences. Without packing, batches are padded to the length of the longest sequence, wasting computation on padding tokens. With packing, multiple shorter sequences are concatenated into single training examples that fill the maximum sequence length, dramatically improving GPU utilization.

The key challenge is the bin packing problem: given sequences of varying lengths, how to optimally assign them to fixed-capacity bins (batches) to minimize wasted space. Axolotl uses a First Fit Decreasing (FFD) algorithm variant that sorts sequences by length and greedily assigns each to the first bin with sufficient remaining capacity.

Sequence packing provides 1.5-3x training speedup depending on the length distribution of the training data, with minimal impact on training quality when proper attention masking is used.

Usage

Use sequence packing when:

Training data has highly variable sequence lengths
GPU utilization is low due to excessive padding
Training throughput needs to be maximized
The dataset is large enough that packing efficiency matters

Theoretical Basis

Bin Packing Problem:

Given sequences of lengths $l_{1}, l_{2}, . . ., l_{n}$ and bin capacity $C$ (max sequence length), find an assignment that minimizes the number of bins used.

First Fit Decreasing (FFD) Algorithm:

# Pseudo-code for FFD bin packing
sequences = sort_by_length_descending(sequences)
bins = []
for seq in sequences:
    placed = False
    for bin in bins:
        if bin.remaining_capacity >= len(seq):
            bin.add(seq)
            placed = True
            break
    if not placed:
        new_bin = Bin(capacity=max_seq_len)
        new_bin.add(seq)
        bins.append(new_bin)

Packing Efficiency: Failed to parse (syntax error): {\displaystyle \text{efficiency} = \frac{\sum_i l_i}{\text{num\_bins} \times C} }

Typical packing efficiency: 85-98% (vs 30-60% without packing for variable-length data).

Attention Masking: Packed sequences require block-diagonal attention masks to prevent cross-contamination between sequences within the same bin.

Related Pages

Implemented By

Implementation:Axolotl_ai_cloud_Axolotl_MultipackBatchSampler

Uses Heuristic

Heuristic:Axolotl_ai_cloud_Axolotl_Sample_Packing_Best_Practices

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment