Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Axolotl ai cloud Axolotl Sequence Packing

From Leeroopedia


Knowledge Sources
Domains Training_Efficiency, Data_Loading, Optimization
Last Updated 2026-02-06 23:00 GMT

Overview

A batch construction technique that packs multiple variable-length sequences into fixed-capacity bins to maximize GPU utilization and minimize padding waste during training.

Description

Sequence Packing (also called Sample Packing) addresses the fundamental inefficiency of training with variable-length sequences. Without packing, batches are padded to the length of the longest sequence, wasting computation on padding tokens. With packing, multiple shorter sequences are concatenated into single training examples that fill the maximum sequence length, dramatically improving GPU utilization.

The key challenge is the bin packing problem: given sequences of varying lengths, how to optimally assign them to fixed-capacity bins (batches) to minimize wasted space. Axolotl uses a First Fit Decreasing (FFD) algorithm variant that sorts sequences by length and greedily assigns each to the first bin with sufficient remaining capacity.

Sequence packing provides 1.5-3x training speedup depending on the length distribution of the training data, with minimal impact on training quality when proper attention masking is used.

Usage

Use sequence packing when:

  • Training data has highly variable sequence lengths
  • GPU utilization is low due to excessive padding
  • Training throughput needs to be maximized
  • The dataset is large enough that packing efficiency matters

Theoretical Basis

Bin Packing Problem:

Given sequences of lengths l1,l2,...,ln and bin capacity C (max sequence length), find an assignment that minimizes the number of bins used.

First Fit Decreasing (FFD) Algorithm:

# Pseudo-code for FFD bin packing
sequences = sort_by_length_descending(sequences)
bins = []
for seq in sequences:
    placed = False
    for bin in bins:
        if bin.remaining_capacity >= len(seq):
            bin.add(seq)
            placed = True
            break
    if not placed:
        new_bin = Bin(capacity=max_seq_len)
        new_bin.add(seq)
        bins.append(new_bin)

Packing Efficiency: Failed to parse (syntax error): {\displaystyle \text{efficiency} = \frac{\sum_i l_i}{\text{num\_bins} \times C} }

Typical packing efficiency: 85-98% (vs 30-60% without packing for variable-length data).

Attention Masking: Packed sequences require block-diagonal attention masks to prevent cross-contamination between sequences within the same bin.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment