Principle:Axolotl ai cloud Axolotl Preference Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Preparation, Alignment, Reinforcement_Learning |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
A data pipeline pattern that loads and formats preference data consisting of chosen/rejected response pairs for alignment training methods like DPO, IPO, and KTO.
Description
Preference Dataset Preparation transforms raw preference data into the format required by alignment training methods. Unlike SFT data which has single instruction-response pairs, preference data contains paired responses: a chosen (preferred) response and a rejected (dispreferred) response for each prompt. This paired structure enables the model to learn which outputs are more desirable.
The pipeline handles multiple preference formats: DPO (chosen/rejected pairs), KTO (binary thumbs up/down), and ORPO (odds ratio preference). Each format has a dedicated prompt strategy that structures the data appropriately for its respective trainer.
Usage
Use this principle when preparing data for:
- DPO (Direct Preference Optimization) training
- IPO (Identity Preference Optimization) training
- KTO (Kahneman-Tversky Optimization) training
- ORPO (Odds Ratio Preference Optimization) training
- SimPO (Simple Preference Optimization) training
Theoretical Basis
Preference data captures human judgments about response quality:
Data format:
# Abstract preference data structure
{
"prompt": "Explain quantum computing",
"chosen": "Quantum computing uses qubits...", # Preferred response
"rejected": "Quantum computing is magic..." # Dispreferred response
}
Key processing steps:
- Loading: Fetch paired preference data from HuggingFace or local files
- Formatting: Apply chat template to prompt/chosen/rejected
- Tokenization: Encode all three parts with proper special tokens
- Deduplication: Remove exact duplicate pairs
- Splitting: Divide into train/eval sets