Principle:NVIDIA NeMo Aligner DPO Preference Data Preparation

Principle: DPO Preference Data Preparation
Type	Principle
Project	NVIDIA NeMo Aligner
Domains	NLP, Data_Engineering
Related	Implementation:NVIDIA_NeMo_Aligner_Build_DPO_Datasets
Last Updated	2026-02-07 00:00 GMT

Overview

Process of constructing preference pair datasets with prompt-level labels for Direct Preference Optimization training.

Description

DPO training requires datasets with prompt, chosen response, and rejected response triples. Unlike reward model training data, DPO data also preserves prompt boundaries (via label masking) since DPO computes log probabilities over response tokens only.

The data preparation step performs the following operations:

Tokenizes both chosen and rejected responses with their shared prompt.
Creates label tensors with prompt tokens masked to -100 (ignored in loss computation).
Optionally includes ground-truth reward values for monitoring training progress.
Handles variable-length padding with distributed synchronization to ensure consistent tensor shapes across data-parallel ranks.

The custom collation function performs an all-reduce across distributed ranks to find the global maximum sequence length, ensuring that all ranks produce identically shaped tensors for efficient data-parallel training.

Usage

Use when preparing data for DPO, IPO, or RPO training.

Input format is JSONL with prompt/chosen_response/rejected_response fields, or OpenAI conversation format.
Output is a DPOModelDataset returning dict batches with both chosen and rejected variants and their labels.
Each sample produces two tokenized sequences (chosen and rejected) that share the same prompt prefix.

Theoretical Basis

DPO loss requires computing log pi_theta(y|x) for both chosen and rejected responses. Label masking ensures only response tokens contribute to log-probability computation:

log pi_theta(y|x) = sum over t in response_tokens of log P(y_t | x, y_{<t})

Labels:
  prompt tokens  -> masked to -100 (excluded from loss)
  response tokens -> actual token IDs (included in loss)

The collation pads chosen and rejected to identical lengths within each batch for efficient parallel computation, using distributed all-reduce to find the global maximum sequence length:

global_max_len = all_reduce(local_max_len, op=MAX)
pad all sequences to global_max_len

Pseudo-code

FUNCTION build_dpo_dataset(data_path, tokenizer, max_seq_length):
    samples = load_jsonl(data_path)

    FOR each sample in samples:
        prompt_tokens = tokenize(sample.prompt)
        chosen_tokens = tokenize(sample.chosen_response)
        rejected_tokens = tokenize(sample.rejected_response)

        chosen_labels = [-100] * len(prompt_tokens) + chosen_tokens
        rejected_labels = [-100] * len(prompt_tokens) + rejected_tokens

        truncate_to(max_seq_length)
        store(chosen_input, chosen_labels, rejected_input, rejected_labels)

    RETURN DPOModelDataset(all_samples)

FUNCTION collate_fn(batch, distributed_ranks):
    local_max = max(sequence lengths in batch)
    global_max = all_reduce(local_max, op=MAX)
    pad all sequences to global_max
    RETURN padded_batch

Related Pages

Implementation:NVIDIA_NeMo_Aligner_Build_DPO_Datasets

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment