Principle:NVIDIA NeMo Aligner DPO Sequence Packing
| Principle: DPO Sequence Packing | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | Data_Engineering, Performance_Optimization |
| Related | Implementation:NVIDIA_NeMo_Aligner_Prepare_Packed_DPO_Dataset |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Preprocessing technique that concatenates multiple short preference pairs into fixed-length packed sequences for improved GPU utilization.
Description
When training DPO on datasets with many short examples, naive padding wastes significant GPU memory and compute on padding tokens. Sequence packing solves this by concatenating multiple chosen-rejected pairs into single sequences of a target length, using bin-packing algorithms to minimize wasted space.
The packed format includes sequence boundaries (cu_seqlens) so the model can apply attention masks that prevent cross-example attention. This ensures that tokens from one preference pair do not attend to tokens from a different pair packed into the same sequence.
The packing process involves:
- Histogram construction of sequence lengths across the dataset
- Bin-packing algorithm to assign sequences to fixed-capacity bins
- Concatenation of assigned sequences with boundary markers
- Output as NPY files consumed by
DPOPackedDatasetduring training
Usage
Use as an optional preprocessing step before DPO training when examples are significantly shorter than the maximum sequence length.
- Run
prepare_packed_dpo_dataset.pybefore training. - The training config must set
data_implto"packed_jsonl"to use packed data. - Most beneficial when the ratio of average sequence length to max sequence length is low (e.g., many sequences under 512 tokens with a 2048 max).
- Not recommended when sequences already approach the maximum length, as packing overhead outweighs padding savings.
Theoretical Basis
The core problem is a variant of the bin-packing problem: given N sequences of varying lengths, pack them into bins of fixed capacity to minimize the number of bins (and thus wasted padding).
Given:
sequences S_1, S_2, ..., S_N with lengths l_1, l_2, ..., l_N
bin capacity C (target packed sequence length)
Goal:
Assign sequences to bins such that:
- sum of lengths in each bin <= C
- number of bins is minimized
The algorithm creates histograms of sequence lengths, applies a packing strategy (e.g., first-fit decreasing), then fills packed sequences. Custom attention masks (via seq_boundaries / cu_seqlens) ensure mathematical equivalence to unpacked training:
Attention mask for packed sequence:
For token i belonging to example k:
attend to token j IF AND ONLY IF j also belongs to example k
Pseudo-code
FUNCTION prepare_packed_dpo_dataset(input_path, output_path, target_length):
samples = load_dpo_samples(input_path)
# Compute lengths for all chosen and rejected sequences
lengths = compute_pair_lengths(samples)
# Build histogram and apply bin-packing
bins = bin_pack_first_fit_decreasing(lengths, capacity=target_length)
# Create packed sequences
FOR each bin in bins:
packed_tokens = concatenate(sequences in bin)
cu_seqlens = compute_cumulative_sequence_lengths(sequences in bin)
store_packed(packed_tokens, cu_seqlens)
save_as_npy(output_path)
FUNCTION train_with_packed_data(config):
IF config.data_impl == "packed_jsonl":
dataset = DPOPackedDataset(config.packed_data_path)
ELSE:
dataset = DPOModelDataset(config.data_path)
RETURN dataset
Related Pages
- Implementation:NVIDIA_NeMo_Aligner_Prepare_Packed_DPO_Dataset
- Heuristic:NVIDIA_NeMo_Aligner_DPO_Sequence_Packing_Tips