Heuristic:Huggingface Alignment handbook Sequence Packing Strategy

Knowledge Sources	Alignment Handbook Internal
Domains	Optimization, LLMs, Training
Last Updated	2026-02-07 00:00 GMT

Overview

Enable sequence packing with the First Fit Decreasing (FFD) strategy for maximum GPU utilization when training on variable-length sequences.

Description

Sequence packing concatenates multiple short sequences into a single training example, eliminating padding waste and maximizing GPU utilization. The alignment-handbook supports two modes: standard packing (simple concatenation) and FFD (First Fit Decreasing) which is a bin-packing algorithm that optimally groups sequences by length to minimize wasted tokens. The SmolLM3 SFT recipe uses FFD packing with max_length=65536, combined with assistant_only_loss to train only on assistant responses.

Usage

Apply this when training on datasets with highly variable sequence lengths (e.g., multi-task datasets mixing short conversations and long reasoning traces). FFD packing is most beneficial for large-scale SFT with many diverse splits.

The Insight (Rule of Thumb)

Action: Set `packing: true` and `packing_strategy: ffd` in the training config. Combine with `per_device_train_batch_size: 1` and increased `gradient_accumulation_steps`.
Value: FFD packing can provide 2-4x training throughput improvement over padded batches on variable-length datasets.
Trade-off: Packing requires compatible loss masking (e.g., assistant_only_loss) to avoid cross-contamination between packed sequences.

Reasoning

Without packing, short sequences are padded to max_length, wasting significant compute on padding tokens. Standard packing concatenates sequences but may cut mid-conversation. FFD groups sequences by length to minimize leftover space in each packed batch.

SmolLM3 SFT config from `recipes/smollm3/sft/sft.yaml:211-212`:

packing: true
packing_strategy: ffd

Combined with assistant-only loss from `recipes/smollm3/sft/sft.yaml:193`:

assistant_only_loss: true

Batch size settings optimized for packing from `recipes/smollm3/sft/sft.yaml:198,218`:

gradient_accumulation_steps: 2
per_device_train_batch_size: 1

Standard packing (without FFD) is also used in mid-training from `recipes/smollm3/sft/mid.yaml:47`:

packing: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment