Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Aligner Build DPO Datasets

From Leeroopedia


Implementation Details
Name Build_DPO_Datasets
Type API Doc
Implements Principle DPO_Preference_Data_Preparation
Module nemo_aligner.data.nlp
Repository NeMo Aligner
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for constructing DPO preference pair datasets from JSONL files provided by the NeMo Aligner data builders module.

Description

The build_train_valid_test_dpo_datasets function is a partial application of build_train_valid_test_datasets specialized to DPOModelDataset. The DPOModelDataset class (datasets.py:L301-523) tokenizes prompt/chosen/rejected triples, creates label tensors with prompt positions masked to -100, and handles both plain text and OpenAI conversation formats. The companion dpo_custom_collate function pads variable-length samples with distributed synchronization for consistent tensor shapes.

Usage

Import when setting up DPO, IPO, or RPO training data. For packed sequences, use build_train_valid_test_dpo_packed_datasets instead.

Code Reference

Source Location

  • Repository: NeMo Aligner
  • File: nemo_aligner/data/nlp/builders.py (L394 partial), nemo_aligner/data/nlp/datasets.py (L301-523 DPOModelDataset), nemo_aligner/algorithms/dpo.py (L42-116 dpo_custom_collate)

Signature

build_train_valid_test_dpo_datasets = partial(build_train_valid_test_datasets, DPOModelDataset)

class DPOModelDataset(Dataset):
    def __init__(self, cfg, tokenizer, name, data_prefix, documents, data,
                 seq_length, seed, drop_last=True, pad_chosen_rejected_to_max=True):
        ...
    def __getitem__(self, idx) -> dict:
        """Returns: chosen, rejected, chosen_length, rejected_length,
                   chosen_labels, rejected_labels, chosen_reward, rejected_reward, ignore_example"""

def dpo_custom_collate(
    batch: list[dict],
    eos_id: int,
    reset_position_ids: bool = False,
    reset_attention_mask: bool = False,
    eod_mask_loss: bool = False,
    pad_length_to_multiple_of: int | None = None,
) -> dict[str, torch.Tensor]:
    """Pads and collates DPO batch with distributed max-length sync."""

Import

from nemo_aligner.data.nlp.builders import build_train_valid_test_dpo_datasets
from nemo_aligner.algorithms.dpo import dpo_custom_collate

I/O Contract

Inputs (build_train_valid_test_dpo_datasets)

Name Type Required Description
cfg DictConfig Yes Data config
data_prefix str Yes Path to JSONL with prompt/chosen_response/rejected_response
seq_length int Yes Maximum sequence length
tokenizer TokenizerSpec Yes Model tokenizer

Outputs (build_train_valid_test_dpo_datasets)

Name Type Description
train_ds DPOModelDataset Training preference dataset
val_ds DPOModelDataset Validation dataset
test_ds DPOModelDataset Test dataset

Inputs (dpo_custom_collate)

Name Type Required Description
batch list[dict] Yes List of dataset items
eos_id int Yes EOS token ID for padding

Outputs (dpo_custom_collate)

Name Type Description
output dict chosen, rejected, chosen_labels, rejected_labels, attention_mask, position_ids, chosen_rewards, rejected_rewards (all Tensor)

Usage Examples

from nemo_aligner.data.nlp.builders import build_train_valid_test_dpo_datasets
from nemo_aligner.algorithms.dpo import dpo_custom_collate
from functools import partial

train_ds, val_ds, test_ds = build_train_valid_test_dpo_datasets(
    cfg=cfg.model.data,
    data_prefix=cfg.model.data.data_prefix,
    data_impl="jsonl",
    splits_string="950,25,25",
    train_valid_test_num_samples=[20000, 500, 500],
    seq_length=cfg.model.data.seq_length,
    seed=cfg.model.seed,
    tokenizer=model.tokenizer,
)

collate_fn = partial(dpo_custom_collate, eos_id=model.tokenizer.eos_id)

Related Pages

Knowledge Sources

NLP, Data_Engineering

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment