Principle:Huggingface Alignment handbook Odds Ratio Preference Optimization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A single-stage alignment algorithm that combines supervised fine-tuning and preference optimization in one training pass, eliminating the need for a separate reference model.
Description
Odds Ratio Preference Optimization (ORPO) is an alignment method that unifies SFT and preference optimization into a single training objective. Unlike DPO, which requires a frozen reference model and a separate SFT stage, ORPO adds a preference-aware regularization term to the standard SFT loss using the odds ratio of chosen vs. rejected responses.
ORPO addresses two limitations of the SFT → DPO pipeline: (1) the need for two separate training stages and (2) the memory overhead of loading a reference model during DPO training. By combining both objectives, ORPO reduces training time and infrastructure requirements.
In the alignment-handbook, ORPO is used for training the largest model (Mixtral 8x22B, 141B parameters), where the single-stage approach avoids the memory cost of maintaining both a policy and reference model for DPO.
Usage
Use ORPO when:
- A single-stage alignment process is preferred over multi-stage SFT → DPO
- Memory constraints prevent loading both a policy and reference model (large models)
- Preference data is available but a separate SFT stage is not desired
- Simplicity of the training pipeline is a priority
Theoretical Basis
ORPO combines the SFT loss with an odds ratio preference penalty:
Where the odds ratio loss is:
And the odds of a sequence is defined as:
# Abstract ORPO algorithm (NOT real implementation)
for prompt, chosen, rejected in preference_data:
# SFT component: standard language modeling loss on chosen response
sft_loss = cross_entropy(model(chosen))
# Preference component: odds ratio between chosen and rejected
log_odds_chosen = log(prob(chosen) / (1 - prob(chosen)))
log_odds_rejected = log(prob(rejected) / (1 - prob(rejected)))
or_loss = -log_sigmoid(log_odds_chosen - log_odds_rejected)
loss = sft_loss + beta * or_loss
loss.backward()
Key insight: ORPO does not need a reference model because the odds ratio naturally penalizes the model for assigning high probability to rejected responses, without needing an external reference point.