Workflow:Eric mitchell Direct preference optimization Custom Dataset Integration

Knowledge Sources	Direct Preference Optimization DPO Paper
Domains	LLMs, Data_Engineering, Preference_Learning
Last Updated	2026-02-08 01:00 GMT

Overview

Process for adding a custom preference dataset to the DPO training pipeline, from data loading through canonical format conversion to training integration.

Description

This workflow describes how to integrate a new preference dataset into the DPO codebase. The repository uses a canonical data format where each prompt maps to a dictionary containing a list of responses, preference pairs (indices indicating which response is preferred), and an SFT target response. Three reference implementations are provided (Anthropic HH-RLHF, Stanford Human Preferences, and StackExchange). Adding a new dataset involves implementing a loader function that converts the raw data into this canonical format and registering it in the dataset dispatcher.

Usage

Execute this workflow when you have a custom preference dataset (with human-labeled or model-generated preferences) that is not one of the three built-in datasets (hh, shp, se). This is needed whenever you want to train DPO on domain-specific data, proprietary preference annotations, or a new publicly available preference dataset.

Execution Steps

Step 1: Understand_Canonical_Format

Study the canonical data format expected by the training pipeline. Each dataset loader must return a dictionary mapping prompt strings to inner dictionaries with three required keys. Understanding this contract is essential before implementing a new loader.

Required data structure per prompt:

responses - A list of all response strings associated with this prompt
pairs - A list of tuples (preferred_index, dispreferred_index) indicating preference ordering
sft_target - A single response string to use during SFT training (typically the highest-quality response)

Key considerations:

Prompts should follow the format: "\n\nHuman: {question}\n\nAssistant:"
Response strings should be space-prefixed (e.g., " This is the response")
Multiple preference pairs per prompt are supported
The sft_target may or may not be one of the responses in the responses list

Step 2: Implement_Loader_Function

Create a new loader function (e.g., get_xyz) in the preference_datasets module following the pattern of the existing reference implementations (get_hh, get_shp, get_se). The function takes a split name, a silent flag, and an optional cache directory, and returns the canonical data dictionary.

Key considerations:

Use HuggingFace datasets library for loading when possible (enables caching)
Handle train/test split logic within the loader
Apply any necessary data cleaning (e.g., HTML stripping, score filtering)
For datasets with numeric scores, derive preference pairs from score comparisons (see get_se for all-pairs, get_shp for filtered-ratio approach)

Step 3: Register_Dataset

Add the new dataset to the get_dataset dispatcher function so it can be referenced by name in training commands. This involves adding an elif branch that calls the new loader function when the dataset name matches.

Key considerations:

The name used in get_dataset must match the CLI argument (e.g., datasets=[xyz])
The assertion after loading verifies the returned dict has exactly the keys: responses, pairs, sft_target
The dataset name determines the truncation mode in get_batch_iterator (keep_end for hh, keep_start for others)

Step 4: Validate_And_Train

Run SFT training with the new dataset to verify the data pipeline works end-to-end. Check that the dataset loads correctly, tokenization produces valid sequences, and the training loop runs without errors. Then proceed to DPO training using the SFT checkpoint.

Key considerations:

Start with a small model (e.g., gpt2-large) for rapid validation before scaling up
Verify that the data is properly formatted by checking the first few batches
The new dataset can be combined with existing datasets (e.g., datasets=[hh,xyz])
Monitor that SFT loss decreases and eval metrics are reasonable before proceeding to DPO

Execution Diagram

GitHub URL

Workflow Repository