Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL DPO Configuration

From Leeroopedia


Knowledge Sources
Domains Alignment, Configuration
Last Updated 2026-02-07 20:00 GMT

Overview

A configuration principle for setting up Direct Preference Optimization training with chosen/rejected response pairs and configurable loss variants.

Description

DPO Configuration manages the hyperparameters for preference-based alignment training. It extends the base configuration with DPO-specific parameters including the beta temperature parameter, IPO variant toggle, label smoothing for conservative DPO, and dataset keys for chosen/rejected response pairs. The configuration also specifies the two required clusters: actor_train (trainable policy) and reference (frozen reference model).

Usage

Use when setting up a DPO training pipeline for LLM alignment using preference data.

Theoretical Basis

DPO directly optimizes the policy without a separate reward model:

LDPO(θ)=𝔼[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))]

Key configuration parameters:

  • beta: Temperature controlling preference sharpness
  • IPO variant: Uses squared hinge loss instead of sigmoid
  • Label smoothing: Conservative DPO with smoothed labels

Related Pages

Implemented By

Related Heuristics

No specific heuristics inform this principle.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment