Heuristic:Huggingface Alignment handbook Sequence Packing Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Enable sequence packing with the First Fit Decreasing (FFD) strategy for maximum GPU utilization when training on variable-length sequences.
Description
Sequence packing concatenates multiple short sequences into a single training example, eliminating padding waste and maximizing GPU utilization. The alignment-handbook supports two modes: standard packing (simple concatenation) and FFD (First Fit Decreasing) which is a bin-packing algorithm that optimally groups sequences by length to minimize wasted tokens. The SmolLM3 SFT recipe uses FFD packing with max_length=65536, combined with assistant_only_loss to train only on assistant responses.
Usage
Apply this when training on datasets with highly variable sequence lengths (e.g., multi-task datasets mixing short conversations and long reasoning traces). FFD packing is most beneficial for large-scale SFT with many diverse splits.
The Insight (Rule of Thumb)
- Action: Set `packing: true` and `packing_strategy: ffd` in the training config. Combine with `per_device_train_batch_size: 1` and increased `gradient_accumulation_steps`.
- Value: FFD packing can provide 2-4x training throughput improvement over padded batches on variable-length datasets.
- Trade-off: Packing requires compatible loss masking (e.g., assistant_only_loss) to avoid cross-contamination between packed sequences.
Reasoning
Without packing, short sequences are padded to max_length, wasting significant compute on padding tokens. Standard packing concatenates sequences but may cut mid-conversation. FFD groups sequences by length to minimize leftover space in each packed batch.
SmolLM3 SFT config from `recipes/smollm3/sft/sft.yaml:211-212`:
packing: true
packing_strategy: ffd
Combined with assistant-only loss from `recipes/smollm3/sft/sft.yaml:193`:
assistant_only_loss: true
Batch size settings optimized for packing from `recipes/smollm3/sft/sft.yaml:198,218`:
gradient_accumulation_steps: 2
per_device_train_batch_size: 1
Standard packing (without FFD) is also used in mid-training from `recipes/smollm3/sft/mid.yaml:47`:
packing: true