Principle:Princeton nlp SimPO Preference Optimization

Knowledge Sources	SimPO DPO RLHF
Domains	Deep_Learning, NLP, Preference_Optimization
Last Updated	2026-02-08 04:30 GMT

Overview

A reference-free preference optimization algorithm that aligns language models using length-normalized average log probabilities as implicit rewards.

Description

SimPO (Simple Preference Optimization) is a preference alignment method that improves upon DPO (Direct Preference Optimization) by eliminating the need for a reference model and using length-normalized log probabilities. In standard DPO, the reward signal is the difference in log probability ratios between the policy model and a frozen reference model. SimPO simplifies this by using the average log probability of the response as the implicit reward, normalized by sequence length. This design choice has two advantages: (1) it removes the computational cost of maintaining a reference model, and (2) it better correlates with the generation metric used at inference (where length-normalized likelihood determines output quality). SimPO also introduces a target reward margin (gamma) that ensures a minimum gap between chosen and rejected rewards, preventing the model from assigning nearly equal scores to both.

Usage

Use SimPO when fine-tuning a language model on preference data (chosen/rejected response pairs). It is preferred over DPO when: (1) memory is constrained (no reference model needed), (2) the model tends to produce length-exploited outputs, or (3) you want training and inference objectives to be better aligned. SimPO supports both sigmoid and hinge loss variants, with optional SFT regularization.

Theoretical Basis

The SimPO loss function operates on length-normalized average log probabilities:

$\bar{r} (x, y) = \frac{1}{| y |} \log π_{θ} (y | x)$

Where $π_{θ} (y | x)$ is the policy model's probability of generating response y given prompt x, and $| y |$ is the response length in tokens.

The SimPO objective (sigmoid variant) is:

$ℒ_{SimPO} = - \log σ (β \cdot (\bar{r} (x, y_{w}) - \bar{r} (x, y_{l}) - γ))$

Where:

$y_{w}$ is the chosen (preferred) response
$y_{l}$ is the rejected (dispreferred) response
$β$ controls the sharpness of the preference signal (default: 2.0)
$γ = β \cdot γ_β_r a t i o$ is the target reward margin (default ratio: 0.25)

The hinge loss variant replaces the sigmoid: $ℒ_{hinge} = \max (0, 1 - β \cdot (\bar{r} (x, y_{w}) - \bar{r} (x, y_{l}) - γ))$

Optional SFT regularization adds a cross-entropy loss on chosen responses: $ℒ_{total} = ℒ_{SimPO} + λ_{SFT} \cdot ℒ_{CE} (y_{w})$

Key differences from DPO:

No reference model — SimPO uses absolute log probabilities, not log probability ratios
Length normalization — Average log prob prevents length exploitation
Reward margin — The gamma term enforces a minimum quality gap

Related Pages

Implemented By

Implementation:Princeton_nlp_SimPO_SimPOTrainer

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment