Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Alignment handbook Odds Ratio Preference Optimization

From Leeroopedia


Knowledge Sources
Domains NLP, Deep_Learning, Reinforcement_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

A single-stage alignment algorithm that combines supervised fine-tuning and preference optimization in one training pass, eliminating the need for a separate reference model.

Description

Odds Ratio Preference Optimization (ORPO) is an alignment method that unifies SFT and preference optimization into a single training objective. Unlike DPO, which requires a frozen reference model and a separate SFT stage, ORPO adds a preference-aware regularization term to the standard SFT loss using the odds ratio of chosen vs. rejected responses.

ORPO addresses two limitations of the SFT → DPO pipeline: (1) the need for two separate training stages and (2) the memory overhead of loading a reference model during DPO training. By combining both objectives, ORPO reduces training time and infrastructure requirements.

In the alignment-handbook, ORPO is used for training the largest model (Mixtral 8x22B, 141B parameters), where the single-stage approach avoids the memory cost of maintaining both a policy and reference model for DPO.

Usage

Use ORPO when:

  • A single-stage alignment process is preferred over multi-stage SFT → DPO
  • Memory constraints prevent loading both a policy and reference model (large models)
  • Preference data is available but a separate SFT stage is not desired
  • Simplicity of the training pipeline is a priority

Theoretical Basis

ORPO combines the SFT loss with an odds ratio preference penalty:

ORPO=SFT+βOR

Where the odds ratio loss is:

OR=logσ(logoddsθ(yw|x)oddsθ(yl|x))

And the odds of a sequence is defined as:

oddsθ(y|x)=Pθ(y|x)1Pθ(y|x)

# Abstract ORPO algorithm (NOT real implementation)
for prompt, chosen, rejected in preference_data:
    # SFT component: standard language modeling loss on chosen response
    sft_loss = cross_entropy(model(chosen))

    # Preference component: odds ratio between chosen and rejected
    log_odds_chosen = log(prob(chosen) / (1 - prob(chosen)))
    log_odds_rejected = log(prob(rejected) / (1 - prob(rejected)))
    or_loss = -log_sigmoid(log_odds_chosen - log_odds_rejected)

    loss = sft_loss + beta * or_loss
    loss.backward()

Key insight: ORPO does not need a reference model because the odds ratio naturally penalizes the model for assigning high probability to rejected responses, without needing an external reference point.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment