Principle:Princeton nlp SimPO Preference Optimization
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, NLP, Preference_Optimization |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
A reference-free preference optimization algorithm that aligns language models using length-normalized average log probabilities as implicit rewards.
Description
SimPO (Simple Preference Optimization) is a preference alignment method that improves upon DPO (Direct Preference Optimization) by eliminating the need for a reference model and using length-normalized log probabilities. In standard DPO, the reward signal is the difference in log probability ratios between the policy model and a frozen reference model. SimPO simplifies this by using the average log probability of the response as the implicit reward, normalized by sequence length. This design choice has two advantages: (1) it removes the computational cost of maintaining a reference model, and (2) it better correlates with the generation metric used at inference (where length-normalized likelihood determines output quality). SimPO also introduces a target reward margin (gamma) that ensures a minimum gap between chosen and rejected rewards, preventing the model from assigning nearly equal scores to both.
Usage
Use SimPO when fine-tuning a language model on preference data (chosen/rejected response pairs). It is preferred over DPO when: (1) memory is constrained (no reference model needed), (2) the model tends to produce length-exploited outputs, or (3) you want training and inference objectives to be better aligned. SimPO supports both sigmoid and hinge loss variants, with optional SFT regularization.
Theoretical Basis
The SimPO loss function operates on length-normalized average log probabilities:
Where is the policy model's probability of generating response y given prompt x, and is the response length in tokens.
The SimPO objective (sigmoid variant) is:
Where:
- is the chosen (preferred) response
- is the rejected (dispreferred) response
- controls the sharpness of the preference signal (default: 2.0)
- is the target reward margin (default ratio: 0.25)
The hinge loss variant replaces the sigmoid:
Optional SFT regularization adds a cross-entropy loss on chosen responses:
Key differences from DPO:
- No reference model — SimPO uses absolute log probabilities, not log probability ratios
- Length normalization — Average log prob prevents length exploitation
- Reward margin — The gamma term enforces a minimum quality gap