Implementation:Huggingface Alignment handbook ORPOTrainer Usage
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for single-stage preference alignment using TRL's ORPOTrainer, as configured by the alignment-handbook ORPO training script.
Description
ORPOTrainer is TRL's implementation of the Odds Ratio Preference Optimization algorithm. In the alignment-handbook, it is initialized in scripts/orpo.py with a single model (no reference model needed), preference dataset, and training arguments. The script structure closely mirrors the DPO script but without the reference model loading step.
The alignment-handbook uses ORPO specifically for the Mixtral 8x22B (Zephyr-141B) recipe, demonstrating its suitability for very large models where maintaining a reference model would be prohibitively expensive.
Usage
Use this when running the ORPO alignment recipe, particularly for large models where single-stage training is preferred.
Code Reference
Source Location
- Repository: alignment-handbook
- File: scripts/orpo.py (lines 122-129 for ORPOTrainer init, lines 68-158 for full main function)
Signature
# From scripts/orpo.py:L122-129
trainer = ORPOTrainer(
model, # AutoModelForCausalLM (no ref_model needed)
args=training_args, # ORPOConfig with beta, max_length, etc.
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=(
dataset[script_args.dataset_test_split]
if training_args.eval_strategy != "no"
else None
),
processing_class=tokenizer, # PreTrainedTokenizer
peft_config=get_peft_config(model_args), # None or LoraConfig
)
Import
from trl import ORPOTrainer, ModelConfig, TrlParser, get_peft_config
from alignment import ORPOConfig, ScriptArguments, get_dataset, get_model, get_tokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | AutoModelForCausalLM | Yes | Pretrained model (base model, not SFT checkpoint) |
| args | ORPOConfig | Yes | Training hyperparameters including beta, max_length, max_prompt_length |
| args.beta | float | Yes | ORPO beta parameter (e.g., 0.05 for Zephyr-141B) |
| args.max_length | int | Yes | Maximum sequence length (e.g., 2048) |
| args.max_prompt_length | int | Yes | Maximum prompt length (e.g., 1792) |
| train_dataset | Dataset | Yes | Preference data with chosen and rejected columns |
| eval_dataset | Dataset | No | Evaluation split |
| processing_class | PreTrainedTokenizer | Yes | Tokenizer with pad_token set |
| peft_config | Optional[PeftConfig] | No | LoRA config (None for full fine-tuning) |
Outputs
| Name | Type | Description |
|---|---|---|
| trainer.train() returns | TrainOutput | Contains global_step, training_loss, metrics |
| checkpoints | Files | Saved to training_args.output_dir |
| metrics | Dict | Training metrics (ORPO loss, SFT loss, odds ratio metrics) |
Usage Examples
ORPO Training Pipeline
from alignment import ORPOConfig, ScriptArguments, get_dataset, get_model, get_tokenizer
from trl import ORPOTrainer, ModelConfig, TrlParser, get_peft_config
# 1. Parse config
parser = TrlParser((ScriptArguments, ORPOConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_and_config()
# 2. Load model (single model, no reference needed), tokenizer, dataset
model = get_model(model_args, training_args)
tokenizer = get_tokenizer(model_args, training_args)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# 3. Load preference dataset
dataset = get_dataset(script_args)
for split in dataset:
if "messages" in dataset[split].column_names:
dataset[split] = dataset[split].remove_columns("messages")
# 4. Initialize ORPO trainer (note: no ref_model)
trainer = ORPOTrainer(
model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset.get("test"),
processing_class=tokenizer,
peft_config=get_peft_config(model_args),
)
# 5. Train and save
trainer.train()
trainer.save_model(training_args.output_dir)
CLI Launch (Multi-Node FSDP)
# ORPO with FSDP for large MoE models
accelerate launch --config_file recipes/accelerate_configs/fsdp.yaml \
scripts/orpo.py \
--config recipes/zephyr-141b-A35b/orpo/config_full.yaml