Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Alignment handbook ORPOTrainer Usage

From Leeroopedia


Knowledge Sources
Domains NLP, Deep_Learning, Reinforcement_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for single-stage preference alignment using TRL's ORPOTrainer, as configured by the alignment-handbook ORPO training script.

Description

ORPOTrainer is TRL's implementation of the Odds Ratio Preference Optimization algorithm. In the alignment-handbook, it is initialized in scripts/orpo.py with a single model (no reference model needed), preference dataset, and training arguments. The script structure closely mirrors the DPO script but without the reference model loading step.

The alignment-handbook uses ORPO specifically for the Mixtral 8x22B (Zephyr-141B) recipe, demonstrating its suitability for very large models where maintaining a reference model would be prohibitively expensive.

Usage

Use this when running the ORPO alignment recipe, particularly for large models where single-stage training is preferred.

Code Reference

Source Location

  • Repository: alignment-handbook
  • File: scripts/orpo.py (lines 122-129 for ORPOTrainer init, lines 68-158 for full main function)

Signature

# From scripts/orpo.py:L122-129
trainer = ORPOTrainer(
    model,                          # AutoModelForCausalLM (no ref_model needed)
    args=training_args,             # ORPOConfig with beta, max_length, etc.
    train_dataset=dataset[script_args.dataset_train_split],
    eval_dataset=(
        dataset[script_args.dataset_test_split]
        if training_args.eval_strategy != "no"
        else None
    ),
    processing_class=tokenizer,     # PreTrainedTokenizer
    peft_config=get_peft_config(model_args),  # None or LoraConfig
)

Import

from trl import ORPOTrainer, ModelConfig, TrlParser, get_peft_config
from alignment import ORPOConfig, ScriptArguments, get_dataset, get_model, get_tokenizer

I/O Contract

Inputs

Name Type Required Description
model AutoModelForCausalLM Yes Pretrained model (base model, not SFT checkpoint)
args ORPOConfig Yes Training hyperparameters including beta, max_length, max_prompt_length
args.beta float Yes ORPO beta parameter (e.g., 0.05 for Zephyr-141B)
args.max_length int Yes Maximum sequence length (e.g., 2048)
args.max_prompt_length int Yes Maximum prompt length (e.g., 1792)
train_dataset Dataset Yes Preference data with chosen and rejected columns
eval_dataset Dataset No Evaluation split
processing_class PreTrainedTokenizer Yes Tokenizer with pad_token set
peft_config Optional[PeftConfig] No LoRA config (None for full fine-tuning)

Outputs

Name Type Description
trainer.train() returns TrainOutput Contains global_step, training_loss, metrics
checkpoints Files Saved to training_args.output_dir
metrics Dict Training metrics (ORPO loss, SFT loss, odds ratio metrics)

Usage Examples

ORPO Training Pipeline

from alignment import ORPOConfig, ScriptArguments, get_dataset, get_model, get_tokenizer
from trl import ORPOTrainer, ModelConfig, TrlParser, get_peft_config

# 1. Parse config
parser = TrlParser((ScriptArguments, ORPOConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_and_config()

# 2. Load model (single model, no reference needed), tokenizer, dataset
model = get_model(model_args, training_args)
tokenizer = get_tokenizer(model_args, training_args)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 3. Load preference dataset
dataset = get_dataset(script_args)
for split in dataset:
    if "messages" in dataset[split].column_names:
        dataset[split] = dataset[split].remove_columns("messages")

# 4. Initialize ORPO trainer (note: no ref_model)
trainer = ORPOTrainer(
    model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset.get("test"),
    processing_class=tokenizer,
    peft_config=get_peft_config(model_args),
)

# 5. Train and save
trainer.train()
trainer.save_model(training_args.output_dir)

CLI Launch (Multi-Node FSDP)

# ORPO with FSDP for large MoE models
accelerate launch --config_file recipes/accelerate_configs/fsdp.yaml \
    scripts/orpo.py \
    --config recipes/zephyr-141b-A35b/orpo/config_full.yaml

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment