Principle:NVIDIA NeMo Aligner DPO Sequence Packing

Principle: DPO Sequence Packing
Type	Principle
Project	NVIDIA NeMo Aligner
Domains	Data_Engineering, Performance_Optimization
Related	Implementation:NVIDIA_NeMo_Aligner_Prepare_Packed_DPO_Dataset
Last Updated	2026-02-07 00:00 GMT

Overview

Preprocessing technique that concatenates multiple short preference pairs into fixed-length packed sequences for improved GPU utilization.

Description

When training DPO on datasets with many short examples, naive padding wastes significant GPU memory and compute on padding tokens. Sequence packing solves this by concatenating multiple chosen-rejected pairs into single sequences of a target length, using bin-packing algorithms to minimize wasted space.

The packed format includes sequence boundaries (cu_seqlens) so the model can apply attention masks that prevent cross-example attention. This ensures that tokens from one preference pair do not attend to tokens from a different pair packed into the same sequence.

The packing process involves:

Histogram construction of sequence lengths across the dataset
Bin-packing algorithm to assign sequences to fixed-capacity bins
Concatenation of assigned sequences with boundary markers
Output as NPY files consumed by DPOPackedDataset during training

Usage

Use as an optional preprocessing step before DPO training when examples are significantly shorter than the maximum sequence length.

Run prepare_packed_dpo_dataset.py before training.
The training config must set data_impl to "packed_jsonl" to use packed data.
Most beneficial when the ratio of average sequence length to max sequence length is low (e.g., many sequences under 512 tokens with a 2048 max).
Not recommended when sequences already approach the maximum length, as packing overhead outweighs padding savings.

Theoretical Basis

The core problem is a variant of the bin-packing problem: given N sequences of varying lengths, pack them into bins of fixed capacity to minimize the number of bins (and thus wasted padding).

Given:
  sequences S_1, S_2, ..., S_N with lengths l_1, l_2, ..., l_N
  bin capacity C (target packed sequence length)

Goal:
  Assign sequences to bins such that:
    - sum of lengths in each bin <= C
    - number of bins is minimized

The algorithm creates histograms of sequence lengths, applies a packing strategy (e.g., first-fit decreasing), then fills packed sequences. Custom attention masks (via seq_boundaries / cu_seqlens) ensure mathematical equivalence to unpacked training:

Attention mask for packed sequence:
  For token i belonging to example k:
    attend to token j IF AND ONLY IF j also belongs to example k

Pseudo-code

FUNCTION prepare_packed_dpo_dataset(input_path, output_path, target_length):
    samples = load_dpo_samples(input_path)

    # Compute lengths for all chosen and rejected sequences
    lengths = compute_pair_lengths(samples)

    # Build histogram and apply bin-packing
    bins = bin_pack_first_fit_decreasing(lengths, capacity=target_length)

    # Create packed sequences
    FOR each bin in bins:
        packed_tokens = concatenate(sequences in bin)
        cu_seqlens = compute_cumulative_sequence_lengths(sequences in bin)
        store_packed(packed_tokens, cu_seqlens)

    save_as_npy(output_path)

FUNCTION train_with_packed_data(config):
    IF config.data_impl == "packed_jsonl":
        dataset = DPOPackedDataset(config.packed_data_path)
    ELSE:
        dataset = DPOModelDataset(config.data_path)
    RETURN dataset

Related Pages

Knowledge Sources

NeMo Aligner

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment