Principle:Huggingface Alignment handbook Odds Ratio Preference Optimization

Knowledge Sources	Alignment Handbook ORPO: Monolithic Preference Optimization without Reference Model TRL ORPOTrainer
Domains	NLP, Deep_Learning, Reinforcement_Learning
Last Updated	2026-02-07 00:00 GMT

Overview

A single-stage alignment algorithm that combines supervised fine-tuning and preference optimization in one training pass, eliminating the need for a separate reference model.

Description

Odds Ratio Preference Optimization (ORPO) is an alignment method that unifies SFT and preference optimization into a single training objective. Unlike DPO, which requires a frozen reference model and a separate SFT stage, ORPO adds a preference-aware regularization term to the standard SFT loss using the odds ratio of chosen vs. rejected responses.

ORPO addresses two limitations of the SFT → DPO pipeline: (1) the need for two separate training stages and (2) the memory overhead of loading a reference model during DPO training. By combining both objectives, ORPO reduces training time and infrastructure requirements.

In the alignment-handbook, ORPO is used for training the largest model (Mixtral 8x22B, 141B parameters), where the single-stage approach avoids the memory cost of maintaining both a policy and reference model for DPO.

Usage

Use ORPO when:

A single-stage alignment process is preferred over multi-stage SFT → DPO
Memory constraints prevent loading both a policy and reference model (large models)
Preference data is available but a separate SFT stage is not desired
Simplicity of the training pipeline is a priority

Theoretical Basis

ORPO combines the SFT loss with an odds ratio preference penalty:

$ℒ_{O R P O} = ℒ_{S F T} + β \cdot ℒ_{O R}$

Where the odds ratio loss is:

$ℒ_{O R} = - \log σ (\log \frac{o d d s_{θ} (y_{w} | x)}{o d d s_{θ} (y_{l} | x)})$

And the odds of a sequence is defined as:

$o d d s_{θ} (y | x) = \frac{P_{θ} (y | x)}{1 - P_{θ} (y | x)}$

# Abstract ORPO algorithm (NOT real implementation)
for prompt, chosen, rejected in preference_data:
    # SFT component: standard language modeling loss on chosen response
    sft_loss = cross_entropy(model(chosen))

    # Preference component: odds ratio between chosen and rejected
    log_odds_chosen = log(prob(chosen) / (1 - prob(chosen)))
    log_odds_rejected = log(prob(rejected) / (1 - prob(rejected)))
    or_loss = -log_sigmoid(log_odds_chosen - log_odds_rejected)

    loss = sft_loss + beta * or_loss
    loss.backward()

Key insight: ORPO does not need a reference model because the odds ratio naturally penalizes the model for assigning high probability to rejected responses, without needing an external reference point.

Related Pages

Implemented By

Implementation:Huggingface_Alignment_handbook_ORPOTrainer_Usage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment