Implementation:NVIDIA NeMo Aligner Prepare Packed DPO Dataset
| Implementation Details | |
|---|---|
| Name | Prepare_Packed_DPO_Dataset |
| Type | External Tool Doc |
| Implements Principle | DPO_Sequence_Packing |
| Module | examples.nlp.data.dpo |
| Repository | NeMo Aligner |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete preprocessing tool for packing DPO preference pairs into fixed-length sequences using bin-packing algorithms provided by NeMo Aligner's data utilities.
Description
The prepare_packed_dpo_dataset.py script preprocesses DPO datasets by: (1) tokenizing all preference pairs via tokenize_dataset, (2) creating a histogram of sequence lengths, (3) applying a bin-packing algorithm to assign sequences to fixed-size bins, (4) filling packed sequences via fill_packing_strategy with concatenated input_ids, labels, rewards, lengths, and sequence boundaries. The output NPY files are consumed by DPOPackedDataset during training.
Usage
Run as a preprocessing step before DPO training when using packed sequences. Execute from command line with Hydra config. Then set data_impl="packed_jsonl" in the DPO training config.
Code Reference
Source Location
- Repository: NeMo Aligner
- File:
examples/nlp/data/dpo/prepare_packed_dpo_dataset.py - Lines: L82-266
Signature
def tokenize_dataset(cfg: DictConfig, tokenizer_type: str) -> np.ndarray:
"""Tokenize DPO dataset and return array of combined chosen/rejected dicts."""
def fill_packing_strategy(
assignments: List[List[int]],
sequences: Dict[int, List[Dict]],
pack_size: int,
) -> List[Dict]:
"""Fill bin-packing assignments with actual sequence data.
Returns list of packed records with input_ids, labels, reward, lengths, seq_boundaries."""
Import
# Run as standalone script:
# python examples/nlp/data/dpo/prepare_packed_dpo_dataset.py <hydra-overrides>
# Internal imports (for reference):
from nemo.utils.sequence_packing_utils import create_hist, create_packing_strategy
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cfg | DictConfig | Yes | Same config as gpt_dpo.yaml |
| tokenizer_type | str | Yes | Tokenizer identifier (huggingface or sentencepiece) |
Outputs
| Name | Type | Description |
|---|---|---|
| packed_data | NPY files | Array of dicts with input_ids, labels, reward, lengths, seq_boundaries |
Usage Examples
# Step 1: Run packing preprocessing
python examples/nlp/data/dpo/prepare_packed_dpo_dataset.py \
model.data.data_prefix=/data/dpo_train.jsonl \
model.data.seq_length=4096 \
tokenizer_type=huggingface
# Step 2: Train with packed data
python examples/nlp/gpt/train_gpt_dpo.py \
model.data.data_impl=packed_jsonl \
model.data.data_prefix=/data/dpo_train_packed.npy
Related Pages
- Principle:NVIDIA_NeMo_Aligner_DPO_Sequence_Packing
- Environment:NVIDIA_NeMo_Aligner_NeMo_Framework_GPU_Environment
- Heuristic:NVIDIA_NeMo_Aligner_DPO_Sequence_Packing_Tips