Heuristic:NVIDIA NeMo Aligner DPO Sequence Packing Tips
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management, DPO |
| Last Updated | 2026-02-07 22:00 GMT |
Overview
Sequence packing technique for DPO training that concatenates multiple training examples into single sequences, eliminating padding waste and improving GPU utilization by up to 2-3x.
Description
In standard DPO training, each micro-batch contains one chosen and one rejected response, padded to the maximum sequence length. This wastes GPU compute on padding tokens. Sequence packing concatenates multiple chosen-rejected pairs into a single long sequence (up to `encoder_seq_length`), using attention masks to prevent cross-contamination between packed examples. This eliminates padding overhead and allows processing more examples per step. However, it comes with strict constraints: micro-batch size must be 1 and Transformer Engine must be enabled.
Usage
Use this heuristic when training DPO models with short sequences relative to the maximum sequence length. The benefit is proportional to the ratio of average sequence length to maximum sequence length. If most sequences are already near the maximum length, packing provides minimal benefit. Prepare packed datasets using the `prepare_packed_dpo_dataset.py` script.
The Insight (Rule of Thumb)
- Action: Enable sequence packing for DPO by running `prepare_packed_dpo_dataset.py` and setting `model.data.data_impl=packed_jsonl` in the DPO config.
- Value: Scale global batch size down: `new_GBS = unpacked_GBS / avg_num_sequences_per_pack`.
- Trade-off: Faster training per step but requires pre-processing dataset, MBS=1 only, and Transformer Engine.
- Constraints:
- Micro-batch size must be 1
- Transformer Engine must be enabled
- Global batch size should be adjusted proportionally
Reasoning
Packing eliminates wasted compute on padding tokens. For example, if your max sequence length is 4096 but average sequences are 512 tokens, standard batching wastes ~87.5% of compute on padding. Packing fits ~8 examples into each 4096-token slot, achieving near-perfect GPU utilization. The MBS=1 constraint exists because the packing implementation uses per-sample attention masks that are incompatible with batched processing. Transformer Engine is required because it supports the custom attention mask format needed for packed sequences.
Code Evidence
MBS=1 and Transformer Engine assertions from `nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py:435-440`:
assert (
batch["input_ids"].shape[0] == 1
), f"Packed sequence is only supported with micro batch size 1,"
assert (
self.transformer_engine
), "Transformer Engine should be enabled when using sequence packing."
Documentation constraints from `docs/user-guide/dpo.rst:153-156`:
1. Sequence packing can only be run with a micro batch size of 1.
2. Sequence packing is supported via Transformer Engine, so be sure to enable transformer engine
3. Sequence packing increases the number of examples processed per global batch.
Try to scale your global batch size accordingly by setting the new global batch size to
approximately unpacked_global_batch_size / avg_num_sequences_per_pack.