Principle:Eric mitchell Direct preference optimization Batch Data Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Pipeline, Preprocessing, NLP |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
A data iteration pattern that loads, tokenizes, and batches preference datasets for either SFT or DPO training with configurable epoch and example limits.
Description
The batch data pipeline is the central data feeding mechanism that bridges raw preference datasets and the training loop. It orchestrates:
- Dataset loading: Fetching one or more named datasets via the get_dataset dispatcher
- Flattening: Converting the nested prompt-responses structure into flat (prompt, responses, pairs, sft_target) tuples
- Mode branching: In SFT mode, using only the sft_target as both chosen and rejected. In DPO mode, iterating over all preference pairs for each prompt.
- Tokenization: Applying tokenize_batch_element to each example with proper truncation
- Batching and collation: Grouping examples into fixed-size batches and padding to uniform lengths
- Epoch/example control: Stopping after a specified number of epochs or examples, whichever comes first
- Reproducible shuffling: Using seeded randomness for data ordering
Usage
Use this principle when you need to feed tokenized preference data to a training loop. The pipeline is used for both training and evaluation data, with different configurations for split, batch size, and iteration limits.
Theoretical Basis
Preference-based training requires structured data where each example contains a prompt paired with preferred and dispreferred responses. The pipeline converts this into tensor batches suitable for gradient computation.
The two modes serve different training objectives:
- SFT mode: Uses maximum likelihood on the preferred response only (sft_target), treating it as standard language model training
- DPO mode: Uses all preference pairs, providing both chosen and rejected responses for contrastive learning
Pseudo-code:
# Abstract batch pipeline (NOT actual implementation)
for dataset_name in dataset_names:
raw_data = load_dataset(dataset_name)
for prompt, data in raw_data.items():
if sft_mode:
yield tokenize(prompt, data.sft_target, data.sft_target)
else:
for winner_idx, loser_idx in data.pairs:
yield tokenize(prompt, data.responses[winner_idx], data.responses[loser_idx])