Implementation:NVIDIA NeMo Aligner Prepare Packed DPO Dataset

Implementation Details
Name	Prepare_Packed_DPO_Dataset
Type	External Tool Doc
Implements Principle	DPO_Sequence_Packing
Module	examples.nlp.data.dpo
Repository	NeMo Aligner
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete preprocessing tool for packing DPO preference pairs into fixed-length sequences using bin-packing algorithms provided by NeMo Aligner's data utilities.

Description

The prepare_packed_dpo_dataset.py script preprocesses DPO datasets by: (1) tokenizing all preference pairs via tokenize_dataset, (2) creating a histogram of sequence lengths, (3) applying a bin-packing algorithm to assign sequences to fixed-size bins, (4) filling packed sequences via fill_packing_strategy with concatenated input_ids, labels, rewards, lengths, and sequence boundaries. The output NPY files are consumed by DPOPackedDataset during training.

Usage

Run as a preprocessing step before DPO training when using packed sequences. Execute from command line with Hydra config. Then set data_impl="packed_jsonl" in the DPO training config.

Code Reference

Source Location

Repository: NeMo Aligner
File: examples/nlp/data/dpo/prepare_packed_dpo_dataset.py
Lines: L82-266

Signature

def tokenize_dataset(cfg: DictConfig, tokenizer_type: str) -> np.ndarray:
    """Tokenize DPO dataset and return array of combined chosen/rejected dicts."""

def fill_packing_strategy(
    assignments: List[List[int]],
    sequences: Dict[int, List[Dict]],
    pack_size: int,
) -> List[Dict]:
    """Fill bin-packing assignments with actual sequence data.
    Returns list of packed records with input_ids, labels, reward, lengths, seq_boundaries."""

Import

# Run as standalone script:
# python examples/nlp/data/dpo/prepare_packed_dpo_dataset.py <hydra-overrides>

# Internal imports (for reference):
from nemo.utils.sequence_packing_utils import create_hist, create_packing_strategy

I/O Contract

Inputs

Name	Type	Required	Description
cfg	DictConfig	Yes	Same config as gpt_dpo.yaml
tokenizer_type	str	Yes	Tokenizer identifier (huggingface or sentencepiece)

Outputs

Name	Type	Description
packed_data	NPY files	Array of dicts with input_ids, labels, reward, lengths, seq_boundaries

Usage Examples

# Step 1: Run packing preprocessing
python examples/nlp/data/dpo/prepare_packed_dpo_dataset.py \
    model.data.data_prefix=/data/dpo_train.jsonl \
    model.data.seq_length=4096 \
    tokenizer_type=huggingface

# Step 2: Train with packed data
python examples/nlp/gpt/train_gpt_dpo.py \
    model.data.data_impl=packed_jsonl \
    model.data.data_prefix=/data/dpo_train_packed.npy

Related Pages

Knowledge Sources

NeMo Aligner

Data_Engineering, Performance_Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment