Principle:Alibaba ROLL DPO Configuration
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Configuration |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A configuration principle for setting up Direct Preference Optimization training with chosen/rejected response pairs and configurable loss variants.
Description
DPO Configuration manages the hyperparameters for preference-based alignment training. It extends the base configuration with DPO-specific parameters including the beta temperature parameter, IPO variant toggle, label smoothing for conservative DPO, and dataset keys for chosen/rejected response pairs. The configuration also specifies the two required clusters: actor_train (trainable policy) and reference (frozen reference model).
Usage
Use when setting up a DPO training pipeline for LLM alignment using preference data.
Theoretical Basis
DPO directly optimizes the policy without a separate reward model:
Key configuration parameters:
- beta: Temperature controlling preference sharpness
- IPO variant: Uses squared hinge loss instead of sigmoid
- Label smoothing: Conservative DPO with smoothed labels
Related Pages
Implemented By
Related Heuristics
No specific heuristics inform this principle.