Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Eric mitchell Direct preference optimization Batch Data Pipeline

From Leeroopedia


Knowledge Sources
Domains Data_Pipeline, Preprocessing, NLP
Last Updated 2026-02-08 02:00 GMT

Overview

A data iteration pattern that loads, tokenizes, and batches preference datasets for either SFT or DPO training with configurable epoch and example limits.

Description

The batch data pipeline is the central data feeding mechanism that bridges raw preference datasets and the training loop. It orchestrates:

  • Dataset loading: Fetching one or more named datasets via the get_dataset dispatcher
  • Flattening: Converting the nested prompt-responses structure into flat (prompt, responses, pairs, sft_target) tuples
  • Mode branching: In SFT mode, using only the sft_target as both chosen and rejected. In DPO mode, iterating over all preference pairs for each prompt.
  • Tokenization: Applying tokenize_batch_element to each example with proper truncation
  • Batching and collation: Grouping examples into fixed-size batches and padding to uniform lengths
  • Epoch/example control: Stopping after a specified number of epochs or examples, whichever comes first
  • Reproducible shuffling: Using seeded randomness for data ordering

Usage

Use this principle when you need to feed tokenized preference data to a training loop. The pipeline is used for both training and evaluation data, with different configurations for split, batch size, and iteration limits.

Theoretical Basis

Preference-based training requires structured data where each example contains a prompt paired with preferred and dispreferred responses. The pipeline converts this into tensor batches suitable for gradient computation.

The two modes serve different training objectives:

  • SFT mode: Uses maximum likelihood on the preferred response only (sft_target), treating it as standard language model training
  • DPO mode: Uses all preference pairs, providing both chosen and rejected responses for contrastive learning

Pseudo-code:

# Abstract batch pipeline (NOT actual implementation)
for dataset_name in dataset_names:
    raw_data = load_dataset(dataset_name)
    for prompt, data in raw_data.items():
        if sft_mode:
            yield tokenize(prompt, data.sft_target, data.sft_target)
        else:
            for winner_idx, loser_idx in data.pairs:
                yield tokenize(prompt, data.responses[winner_idx], data.responses[loser_idx])

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment