Principle:OpenRLHF OpenRLHF DPO Loss Computation

Knowledge Sources	Direct Preference Optimization IPO: A General Theoretical Paradigm cDPO: Conservative DPO
Domains	Alignment, Loss_Functions
Last Updated	2026-02-07 00:00 GMT

Overview

A family of loss functions that optimize language model policies directly from preference data by comparing log-probability ratios between policy and reference models.

Description

DPO Loss Computation implements the core mathematical operation of Direct Preference Optimization. It computes the implicit reward margin between chosen and rejected responses by comparing the log-probability ratios of the policy model against a frozen reference model. The loss encourages the policy to increase the probability of preferred responses relative to the reference.

Three variants are supported: standard DPO, conservative DPO (with label smoothing), and IPO (Identity Preference Optimization with squared loss).

Usage

Used internally by DPOTrainer. The variant is selected via args.ipo (boolean) and args.label_smoothing (float) parameters.

Theoretical Basis

Standard DPO loss: $L = - \log σ (β (\log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)}))$

Conservative DPO (cDPO): Adds label smoothing $ϵ$ : $L = - (1 - ϵ) \log σ (β \cdot Δ) - ϵ \log σ (- β \cdot Δ)$

IPO: Replaces sigmoid with squared loss: $L = (Δ - \frac{1}{2 β})^{2}$

where $Δ = \log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)}$

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_DPOLoss

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment