Principle:Alibaba ROLL DPO Configuration

Knowledge Sources	DPO IPO Alibaba ROLL
Domains	Alignment, Configuration
Last Updated	2026-02-07 20:00 GMT

Overview

A configuration principle for setting up Direct Preference Optimization training with chosen/rejected response pairs and configurable loss variants.

Description

DPO Configuration manages the hyperparameters for preference-based alignment training. It extends the base configuration with DPO-specific parameters including the beta temperature parameter, IPO variant toggle, label smoothing for conservative DPO, and dataset keys for chosen/rejected response pairs. The configuration also specifies the two required clusters: actor_train (trainable policy) and reference (frozen reference model).

Usage

Use when setting up a DPO training pipeline for LLM alignment using preference data.

Theoretical Basis

DPO directly optimizes the policy without a separate reward model:

$L_{D P O} (θ) = - 𝔼 [\log σ (β \log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - β \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)})]$

Key configuration parameters:

beta: Temperature controlling preference sharpness
IPO variant: Uses squared hinge loss instead of sigmoid
Label smoothing: Conservative DPO with smoothed labels

Related Pages

Implemented By

Implementation:Alibaba_ROLL_DPOConfig

Related Heuristics

No specific heuristics inform this principle.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment