Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner DPO Preference Data Preparation

From Leeroopedia


Principle: DPO Preference Data Preparation
Type Principle
Project NVIDIA NeMo Aligner
Domains NLP, Data_Engineering
Related Implementation:NVIDIA_NeMo_Aligner_Build_DPO_Datasets
Last Updated 2026-02-07 00:00 GMT

Overview

Process of constructing preference pair datasets with prompt-level labels for Direct Preference Optimization training.

Description

DPO training requires datasets with prompt, chosen response, and rejected response triples. Unlike reward model training data, DPO data also preserves prompt boundaries (via label masking) since DPO computes log probabilities over response tokens only.

The data preparation step performs the following operations:

  • Tokenizes both chosen and rejected responses with their shared prompt.
  • Creates label tensors with prompt tokens masked to -100 (ignored in loss computation).
  • Optionally includes ground-truth reward values for monitoring training progress.
  • Handles variable-length padding with distributed synchronization to ensure consistent tensor shapes across data-parallel ranks.

The custom collation function performs an all-reduce across distributed ranks to find the global maximum sequence length, ensuring that all ranks produce identically shaped tensors for efficient data-parallel training.

Usage

Use when preparing data for DPO, IPO, or RPO training.

  • Input format is JSONL with prompt/chosen_response/rejected_response fields, or OpenAI conversation format.
  • Output is a DPOModelDataset returning dict batches with both chosen and rejected variants and their labels.
  • Each sample produces two tokenized sequences (chosen and rejected) that share the same prompt prefix.

Theoretical Basis

DPO loss requires computing log pi_theta(y|x) for both chosen and rejected responses. Label masking ensures only response tokens contribute to log-probability computation:

log pi_theta(y|x) = sum over t in response_tokens of log P(y_t | x, y_{<t})

Labels:
  prompt tokens  -> masked to -100 (excluded from loss)
  response tokens -> actual token IDs (included in loss)

The collation pads chosen and rejected to identical lengths within each batch for efficient parallel computation, using distributed all-reduce to find the global maximum sequence length:

global_max_len = all_reduce(local_max_len, op=MAX)
pad all sequences to global_max_len

Pseudo-code

FUNCTION build_dpo_dataset(data_path, tokenizer, max_seq_length):
    samples = load_jsonl(data_path)

    FOR each sample in samples:
        prompt_tokens = tokenize(sample.prompt)
        chosen_tokens = tokenize(sample.chosen_response)
        rejected_tokens = tokenize(sample.rejected_response)

        chosen_labels = [-100] * len(prompt_tokens) + chosen_tokens
        rejected_labels = [-100] * len(prompt_tokens) + rejected_tokens

        truncate_to(max_seq_length)
        store(chosen_input, chosen_labels, rejected_input, rejected_labels)

    RETURN DPOModelDataset(all_samples)

FUNCTION collate_fn(batch, distributed_ranks):
    local_max = max(sequence lengths in batch)
    global_max = all_reduce(local_max, op=MAX)
    pad all sequences to global_max
    RETURN padded_batch

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment