Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL DPO Loss Computation

From Leeroopedia


Knowledge Sources
Domains Alignment, Optimization
Last Updated 2026-02-07 20:00 GMT

Overview

A loss computation principle implementing DPO and its variants (IPO, cDPO) for preference-based LLM alignment.

Description

DPO Loss Computation implements the core training objective that optimizes a policy to prefer chosen responses over rejected ones. The loss compares log probability ratios between the policy and reference models for both chosen and rejected responses. Three variants are supported:

  • Standard DPO: Sigmoid loss on the log-ratio difference
  • IPO: Squared hinge loss variant for better calibration
  • cDPO: Conservative DPO with label smoothing for noisy preferences

Usage

Use during the policy update step of DPO training, after reference log probabilities have been computed.

Theoretical Basis

Standard DPO Loss

L=logσ(β(ΔchosenΔrejected)) Where Δ=logπθ(y|x)logπref(y|x)

IPO Loss

L=(ΔchosenΔrejected12β)2

cDPO Loss (Label Smoothing)

L=(1α)logσ(βΔ)αlogσ(βΔ)

Related Pages

Implemented By

Related Heuristics

The following heuristics inform this principle:

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment