| Implementation Details
|
| Name |
Build_DPO_Datasets
|
| Type |
API Doc
|
| Implements Principle |
DPO_Preference_Data_Preparation
|
| Module |
nemo_aligner.data.nlp
|
| Repository |
NeMo Aligner
|
| Last Updated |
2026-02-07 00:00 GMT
|
Overview
Concrete tool for constructing DPO preference pair datasets from JSONL files provided by the NeMo Aligner data builders module.
Description
The build_train_valid_test_dpo_datasets function is a partial application of build_train_valid_test_datasets specialized to DPOModelDataset. The DPOModelDataset class (datasets.py:L301-523) tokenizes prompt/chosen/rejected triples, creates label tensors with prompt positions masked to -100, and handles both plain text and OpenAI conversation formats. The companion dpo_custom_collate function pads variable-length samples with distributed synchronization for consistent tensor shapes.
Usage
Import when setting up DPO, IPO, or RPO training data. For packed sequences, use build_train_valid_test_dpo_packed_datasets instead.
Code Reference
Source Location
- Repository: NeMo Aligner
- File:
nemo_aligner/data/nlp/builders.py (L394 partial), nemo_aligner/data/nlp/datasets.py (L301-523 DPOModelDataset), nemo_aligner/algorithms/dpo.py (L42-116 dpo_custom_collate)
Signature
build_train_valid_test_dpo_datasets = partial(build_train_valid_test_datasets, DPOModelDataset)
class DPOModelDataset(Dataset):
def __init__(self, cfg, tokenizer, name, data_prefix, documents, data,
seq_length, seed, drop_last=True, pad_chosen_rejected_to_max=True):
...
def __getitem__(self, idx) -> dict:
"""Returns: chosen, rejected, chosen_length, rejected_length,
chosen_labels, rejected_labels, chosen_reward, rejected_reward, ignore_example"""
def dpo_custom_collate(
batch: list[dict],
eos_id: int,
reset_position_ids: bool = False,
reset_attention_mask: bool = False,
eod_mask_loss: bool = False,
pad_length_to_multiple_of: int | None = None,
) -> dict[str, torch.Tensor]:
"""Pads and collates DPO batch with distributed max-length sync."""
Import
from nemo_aligner.data.nlp.builders import build_train_valid_test_dpo_datasets
from nemo_aligner.algorithms.dpo import dpo_custom_collate
I/O Contract
Inputs (build_train_valid_test_dpo_datasets)
| Name |
Type |
Required |
Description
|
| cfg |
DictConfig |
Yes |
Data config
|
| data_prefix |
str |
Yes |
Path to JSONL with prompt/chosen_response/rejected_response
|
| seq_length |
int |
Yes |
Maximum sequence length
|
| tokenizer |
TokenizerSpec |
Yes |
Model tokenizer
|
Outputs (build_train_valid_test_dpo_datasets)
| Name |
Type |
Description
|
| train_ds |
DPOModelDataset |
Training preference dataset
|
| val_ds |
DPOModelDataset |
Validation dataset
|
| test_ds |
DPOModelDataset |
Test dataset
|
Inputs (dpo_custom_collate)
| Name |
Type |
Required |
Description
|
| batch |
list[dict] |
Yes |
List of dataset items
|
| eos_id |
int |
Yes |
EOS token ID for padding
|
Outputs (dpo_custom_collate)
| Name |
Type |
Description
|
| output |
dict |
chosen, rejected, chosen_labels, rejected_labels, attention_mask, position_ids, chosen_rewards, rejected_rewards (all Tensor)
|
Usage Examples
from nemo_aligner.data.nlp.builders import build_train_valid_test_dpo_datasets
from nemo_aligner.algorithms.dpo import dpo_custom_collate
from functools import partial
train_ds, val_ds, test_ds = build_train_valid_test_dpo_datasets(
cfg=cfg.model.data,
data_prefix=cfg.model.data.data_prefix,
data_impl="jsonl",
splits_string="950,25,25",
train_valid_test_num_samples=[20000, 500, 500],
seq_length=cfg.model.data.seq_length,
seed=cfg.model.seed,
tokenizer=model.tokenizer,
)
collate_fn = partial(dpo_custom_collate, eos_id=model.tokenizer.eos_id)
Related Pages
Knowledge Sources
NLP, Data_Engineering