Principle:OpenRLHF OpenRLHF DPO Loss Computation
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Loss_Functions |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A family of loss functions that optimize language model policies directly from preference data by comparing log-probability ratios between policy and reference models.
Description
DPO Loss Computation implements the core mathematical operation of Direct Preference Optimization. It computes the implicit reward margin between chosen and rejected responses by comparing the log-probability ratios of the policy model against a frozen reference model. The loss encourages the policy to increase the probability of preferred responses relative to the reference.
Three variants are supported: standard DPO, conservative DPO (with label smoothing), and IPO (Identity Preference Optimization with squared loss).
Usage
Used internally by DPOTrainer. The variant is selected via args.ipo (boolean) and args.label_smoothing (float) parameters.
Theoretical Basis
Standard DPO loss:
Conservative DPO (cDPO): Adds label smoothing :
IPO: Replaces sigmoid with squared loss:
where