Principle:Alibaba ROLL DPO Loss Computation

Knowledge Sources	DPO IPO cDPO Alibaba ROLL
Domains	Alignment, Optimization
Last Updated	2026-02-07 20:00 GMT

Overview

A loss computation principle implementing DPO and its variants (IPO, cDPO) for preference-based LLM alignment.

Description

DPO Loss Computation implements the core training objective that optimizes a policy to prefer chosen responses over rejected ones. The loss compares log probability ratios between the policy and reference models for both chosen and rejected responses. Three variants are supported:

Standard DPO: Sigmoid loss on the log-ratio difference
IPO: Squared hinge loss variant for better calibration
cDPO: Conservative DPO with label smoothing for noisy preferences

Usage

Use during the policy update step of DPO training, after reference log probabilities have been computed.

Theoretical Basis

Standard DPO Loss

$L = - \log σ (β \cdot (Δ_{c h o s e n} - Δ_{r e j e c t e d}))$ Where $Δ = \log π_{θ} (y | x) - \log π_{r e f} (y | x)$

IPO Loss

$L = (Δ_{c h o s e n} - Δ_{r e j e c t e d} - \frac{1}{2 β})^{2}$

cDPO Loss (Label Smoothing)

$L = - (1 - α) \log σ (β Δ) - α \log σ (- β Δ)$

Related Pages

Implemented By

Implementation:Alibaba_ROLL_DPO_Loss_Fn

Related Heuristics

The following heuristics inform this principle:

Heuristic:Alibaba_ROLL_Numerical_Stability_Epsilon

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment