Heuristic:NVIDIA NeMo Aligner DPO Sequence Packing Tips

Knowledge Sources	NeMo-Aligner DPO User Guide
Domains	Optimization, Memory_Management, DPO
Last Updated	2026-02-07 22:00 GMT

Overview

Sequence packing technique for DPO training that concatenates multiple training examples into single sequences, eliminating padding waste and improving GPU utilization by up to 2-3x.

Description

In standard DPO training, each micro-batch contains one chosen and one rejected response, padded to the maximum sequence length. This wastes GPU compute on padding tokens. Sequence packing concatenates multiple chosen-rejected pairs into a single long sequence (up to `encoder_seq_length`), using attention masks to prevent cross-contamination between packed examples. This eliminates padding overhead and allows processing more examples per step. However, it comes with strict constraints: micro-batch size must be 1 and Transformer Engine must be enabled.

Usage

Use this heuristic when training DPO models with short sequences relative to the maximum sequence length. The benefit is proportional to the ratio of average sequence length to maximum sequence length. If most sequences are already near the maximum length, packing provides minimal benefit. Prepare packed datasets using the `prepare_packed_dpo_dataset.py` script.

The Insight (Rule of Thumb)

Action: Enable sequence packing for DPO by running `prepare_packed_dpo_dataset.py` and setting `model.data.data_impl=packed_jsonl` in the DPO config.
Value: Scale global batch size down: `new_GBS = unpacked_GBS / avg_num_sequences_per_pack`.
Trade-off: Faster training per step but requires pre-processing dataset, MBS=1 only, and Transformer Engine.
Constraints:
- Micro-batch size must be 1
- Transformer Engine must be enabled
- Global batch size should be adjusted proportionally

Reasoning

Packing eliminates wasted compute on padding tokens. For example, if your max sequence length is 4096 but average sequences are 512 tokens, standard batching wastes ~87.5% of compute on padding. Packing fits ~8 examples into each 4096-token slot, achieving near-perfect GPU utilization. The MBS=1 constraint exists because the packing implementation uses per-sample attention masks that are incompatible with batched processing. Transformer Engine is required because it supports the custom attention mask format needed for packed sequences.

Code Evidence

MBS=1 and Transformer Engine assertions from `nemo_aligner/models/nlp/gpt/megatron_gpt_dpo_model.py:435-440`:

assert (
    batch["input_ids"].shape[0] == 1
), f"Packed sequence is only supported with micro batch size 1,"

assert (
    self.transformer_engine
), "Transformer Engine should be enabled when using sequence packing."

Documentation constraints from `docs/user-guide/dpo.rst:153-156`:

1. Sequence packing can only be run with a micro batch size of 1.
2. Sequence packing is supported via Transformer Engine, so be sure to enable transformer engine
3. Sequence packing increases the number of examples processed per global batch.
   Try to scale your global batch size accordingly by setting the new global batch size to
   approximately unpacked_global_batch_size / avg_num_sequences_per_pack.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment