Principle:NVIDIA NeMo Aligner DPO Preference Data Preparation
| Principle: DPO Preference Data Preparation | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | NLP, Data_Engineering |
| Related | Implementation:NVIDIA_NeMo_Aligner_Build_DPO_Datasets |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Process of constructing preference pair datasets with prompt-level labels for Direct Preference Optimization training.
Description
DPO training requires datasets with prompt, chosen response, and rejected response triples. Unlike reward model training data, DPO data also preserves prompt boundaries (via label masking) since DPO computes log probabilities over response tokens only.
The data preparation step performs the following operations:
- Tokenizes both chosen and rejected responses with their shared prompt.
- Creates label tensors with prompt tokens masked to
-100(ignored in loss computation). - Optionally includes ground-truth reward values for monitoring training progress.
- Handles variable-length padding with distributed synchronization to ensure consistent tensor shapes across data-parallel ranks.
The custom collation function performs an all-reduce across distributed ranks to find the global maximum sequence length, ensuring that all ranks produce identically shaped tensors for efficient data-parallel training.
Usage
Use when preparing data for DPO, IPO, or RPO training.
- Input format is JSONL with
prompt/chosen_response/rejected_responsefields, or OpenAI conversation format. - Output is a
DPOModelDatasetreturning dict batches with both chosen and rejected variants and their labels. - Each sample produces two tokenized sequences (chosen and rejected) that share the same prompt prefix.
Theoretical Basis
DPO loss requires computing log pi_theta(y|x) for both chosen and rejected responses. Label masking ensures only response tokens contribute to log-probability computation:
log pi_theta(y|x) = sum over t in response_tokens of log P(y_t | x, y_{<t})
Labels:
prompt tokens -> masked to -100 (excluded from loss)
response tokens -> actual token IDs (included in loss)
The collation pads chosen and rejected to identical lengths within each batch for efficient parallel computation, using distributed all-reduce to find the global maximum sequence length:
global_max_len = all_reduce(local_max_len, op=MAX)
pad all sequences to global_max_len
Pseudo-code
FUNCTION build_dpo_dataset(data_path, tokenizer, max_seq_length):
samples = load_jsonl(data_path)
FOR each sample in samples:
prompt_tokens = tokenize(sample.prompt)
chosen_tokens = tokenize(sample.chosen_response)
rejected_tokens = tokenize(sample.rejected_response)
chosen_labels = [-100] * len(prompt_tokens) + chosen_tokens
rejected_labels = [-100] * len(prompt_tokens) + rejected_tokens
truncate_to(max_seq_length)
store(chosen_input, chosen_labels, rejected_input, rejected_labels)
RETURN DPOModelDataset(all_samples)
FUNCTION collate_fn(batch, distributed_ranks):
local_max = max(sequence lengths in batch)
global_max = all_reduce(local_max, op=MAX)
pad all sequences to global_max
RETURN padded_batch