Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Princeton nlp SimPO Preference Optimization

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, NLP, Preference_Optimization
Last Updated 2026-02-08 04:30 GMT

Overview

A reference-free preference optimization algorithm that aligns language models using length-normalized average log probabilities as implicit rewards.

Description

SimPO (Simple Preference Optimization) is a preference alignment method that improves upon DPO (Direct Preference Optimization) by eliminating the need for a reference model and using length-normalized log probabilities. In standard DPO, the reward signal is the difference in log probability ratios between the policy model and a frozen reference model. SimPO simplifies this by using the average log probability of the response as the implicit reward, normalized by sequence length. This design choice has two advantages: (1) it removes the computational cost of maintaining a reference model, and (2) it better correlates with the generation metric used at inference (where length-normalized likelihood determines output quality). SimPO also introduces a target reward margin (gamma) that ensures a minimum gap between chosen and rejected rewards, preventing the model from assigning nearly equal scores to both.

Usage

Use SimPO when fine-tuning a language model on preference data (chosen/rejected response pairs). It is preferred over DPO when: (1) memory is constrained (no reference model needed), (2) the model tends to produce length-exploited outputs, or (3) you want training and inference objectives to be better aligned. SimPO supports both sigmoid and hinge loss variants, with optional SFT regularization.

Theoretical Basis

The SimPO loss function operates on length-normalized average log probabilities:

r¯(x,y)=1|y|logπθ(y|x)

Where πθ(y|x) is the policy model's probability of generating response y given prompt x, and |y| is the response length in tokens.

The SimPO objective (sigmoid variant) is:

SimPO=logσ(β(r¯(x,yw)r¯(x,yl)γ))

Where:

  • yw is the chosen (preferred) response
  • yl is the rejected (dispreferred) response
  • β controls the sharpness of the preference signal (default: 2.0)
  • γ=βγ_β_ratio is the target reward margin (default ratio: 0.25)

The hinge loss variant replaces the sigmoid: hinge=max(0,1β(r¯(x,yw)r¯(x,yl)γ))

Optional SFT regularization adds a cross-entropy loss on chosen responses: total=SimPO+λSFTCE(yw)

Key differences from DPO:

  • No reference model — SimPO uses absolute log probabilities, not log probability ratios
  • Length normalization — Average log prob prevents length exploitation
  • Reward margin — The gamma term enforces a minimum quality gap

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment