Principle:Allenai Open instruct Padding Free Training

Knowledge Sources	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Packing with Flash Attention 2 Open Instruct
Domains	Machine Learning, Deep Learning, Systems Optimization
Last Updated	2026-02-07 00:00 GMT

Overview

Padding-free training is an optimization technique that concatenates multiple training examples into a single sequence without padding tokens, using cumulative sequence lengths to demarcate example boundaries, thereby increasing GPU utilization and training throughput.

Description

In standard batched training, sequences of different lengths within a batch are padded to the length of the longest sequence. This wastes computation and memory on padding tokens that do not contribute to the loss. For LLM training, where sequences can vary dramatically in length, the wasted computation can be substantial.

Padding-free training (also called "packing") eliminates this waste by concatenating all examples in a batch into a single long sequence. To prevent cross-contamination between examples during attention computation, the collator provides:

Cumulative sequence lengths (cu_seq_lens): An array indicating where each original example begins and ends within the concatenated sequence. Flash Attention 2 natively supports this format, computing attention independently within each subsequence without cross-attention between examples.

Position IDs: Each example gets its own set of position IDs starting from 0, so the model's rotary positional embeddings (RoPE) treat each example as independent.

Sequence indices (seq_idx): An integer tensor mapping each token to its originating example index, used by some attention implementations.

Separator tokens: A separator ID (default -100, which is also the loss ignore index) is inserted at the boundary between consecutive examples in the labels tensor to prevent the model from being trained to predict the first token of the next example given the last token of the previous one.

Usage

Use padding-free training when:

Training batch sizes are greater than 1 per GPU
Sequence lengths vary significantly within the dataset
Flash Attention 2 is available (required for correct boundary handling)
The model architecture supports padding-free inputs (e.g., LlamaForCausalLM, BambaForCausalLM)

Theoretical Basis

Padding waste: Given a batch of B sequences with lengths l_1, l_2, ..., l_B:

Standard (padded):
  Total tokens processed = B * max(l_1, ..., l_B)
  Useful tokens = sum(l_1, ..., l_B)
  Waste ratio = 1 - sum(l_i) / (B * max(l_i))

Padding-free (packed):
  Total tokens processed = sum(l_1, ..., l_B)
  Useful tokens = sum(l_1, ..., l_B)
  Waste ratio = 0

For a typical instruction-tuning dataset with high length variance, the waste ratio can exceed 50%.

Cumulative sequence lengths: Flash Attention uses cu_seq_lens to determine attention boundaries:

cu_seq_lens = [0, l_1, l_1 + l_2, ..., l_1 + l_2 + ... + l_B]

For attention at position p within example i:
  start = cu_seq_lens[i]
  end   = cu_seq_lens[i+1]
  attention is computed only over positions [start, end)

Position encoding reset: For each example i, positions are reset:

position_ids = concat([
    arange(0, l_1),
    arange(0, l_2),
    ...,
    arange(0, l_B)
])

This ensures RoPE embeddings correctly encode within-example positions.

Label boundary handling: A separator token (-100) is placed at the start of each example's labels to prevent cross-example token prediction:

labels = [-100, tok_1[1:], -100, tok_2[1:], ..., -100, tok_B[1:]]

The -100 tokens are ignored by PyTorch's CrossEntropyLoss.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_TensorDataCollatorWithFlattening

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment