Principle:Alibaba ROLL DPO Loss Computation
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Optimization |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A loss computation principle implementing DPO and its variants (IPO, cDPO) for preference-based LLM alignment.
Description
DPO Loss Computation implements the core training objective that optimizes a policy to prefer chosen responses over rejected ones. The loss compares log probability ratios between the policy and reference models for both chosen and rejected responses. Three variants are supported:
- Standard DPO: Sigmoid loss on the log-ratio difference
- IPO: Squared hinge loss variant for better calibration
- cDPO: Conservative DPO with label smoothing for noisy preferences
Usage
Use during the policy update step of DPO training, after reference log probabilities have been computed.
Theoretical Basis
Standard DPO Loss
Where
IPO Loss
cDPO Loss (Label Smoothing)
Related Pages
Implemented By
Related Heuristics
The following heuristics inform this principle:
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment