Workflow:Eric mitchell Direct preference optimization Custom Dataset Integration
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Data_Engineering, Preference_Learning |
| Last Updated | 2026-02-08 01:00 GMT |
Overview
Process for adding a custom preference dataset to the DPO training pipeline, from data loading through canonical format conversion to training integration.
Description
This workflow describes how to integrate a new preference dataset into the DPO codebase. The repository uses a canonical data format where each prompt maps to a dictionary containing a list of responses, preference pairs (indices indicating which response is preferred), and an SFT target response. Three reference implementations are provided (Anthropic HH-RLHF, Stanford Human Preferences, and StackExchange). Adding a new dataset involves implementing a loader function that converts the raw data into this canonical format and registering it in the dataset dispatcher.
Usage
Execute this workflow when you have a custom preference dataset (with human-labeled or model-generated preferences) that is not one of the three built-in datasets (hh, shp, se). This is needed whenever you want to train DPO on domain-specific data, proprietary preference annotations, or a new publicly available preference dataset.
Execution Steps
Step 1: Understand_Canonical_Format
Study the canonical data format expected by the training pipeline. Each dataset loader must return a dictionary mapping prompt strings to inner dictionaries with three required keys. Understanding this contract is essential before implementing a new loader.
Required data structure per prompt:
- responses - A list of all response strings associated with this prompt
- pairs - A list of tuples (preferred_index, dispreferred_index) indicating preference ordering
- sft_target - A single response string to use during SFT training (typically the highest-quality response)
Key considerations:
- Prompts should follow the format: "\n\nHuman: {question}\n\nAssistant:"
- Response strings should be space-prefixed (e.g., " This is the response")
- Multiple preference pairs per prompt are supported
- The sft_target may or may not be one of the responses in the responses list
Step 2: Implement_Loader_Function
Create a new loader function (e.g., get_xyz) in the preference_datasets module following the pattern of the existing reference implementations (get_hh, get_shp, get_se). The function takes a split name, a silent flag, and an optional cache directory, and returns the canonical data dictionary.
Key considerations:
- Use HuggingFace datasets library for loading when possible (enables caching)
- Handle train/test split logic within the loader
- Apply any necessary data cleaning (e.g., HTML stripping, score filtering)
- For datasets with numeric scores, derive preference pairs from score comparisons (see get_se for all-pairs, get_shp for filtered-ratio approach)
Step 3: Register_Dataset
Add the new dataset to the get_dataset dispatcher function so it can be referenced by name in training commands. This involves adding an elif branch that calls the new loader function when the dataset name matches.
Key considerations:
- The name used in get_dataset must match the CLI argument (e.g., datasets=[xyz])
- The assertion after loading verifies the returned dict has exactly the keys: responses, pairs, sft_target
- The dataset name determines the truncation mode in get_batch_iterator (keep_end for hh, keep_start for others)
Step 4: Validate_And_Train
Run SFT training with the new dataset to verify the data pipeline works end-to-end. Check that the dataset loads correctly, tokenization produces valid sequences, and the training loop runs without errors. Then proceed to DPO training using the SFT checkpoint.
Key considerations:
- Start with a small model (e.g., gpt2-large) for rapid validation before scaling up
- Verify that the data is properly formatted by checking the first few batches
- The new dataset can be combined with existing datasets (e.g., datasets=[hh,xyz])
- Monitor that SFT loss decreases and eval metrics are reasonable before proceeding to DPO