Implementation:Hpcaitech ColossalAI ORPOTrainer
| Knowledge Sources | |
|---|---|
| Domains | RLHF, Preference_Learning, ORPO |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
orpo.py implements the ORPOTrainer class for Odds Ratio Preference Optimization, a simplified preference learning algorithm that combines supervised fine-tuning loss with an odds ratio preference objective without requiring a reference model.
Description
ORPOTrainer extends SLTrainer to implement the ORPO algorithm. The training loop processes batches containing chosen and rejected response pairs (chosen_input_ids, reject_input_ids, with corresponding attention masks and loss masks). For each batch, both chosen and rejected inputs are concatenated and fed through the model in a single forward pass. The trainer computes log probabilities for both chosen and rejected responses using calc_masked_log_probs, then calculates the OddsRatioLoss from the log probability difference. The final loss combines the standard next-token prediction (NLL) loss on chosen responses with the odds ratio loss scaled by lambda. Key metrics tracked include loss, chosen rewards (mean log probability), rejected rewards, reward margin, log odds ratio, and reward accuracy (fraction of examples where log odds ratio > 0). The trainer supports gradient accumulation, periodic checkpointing via save_checkpoint, and logging to TensorBoard and Weights & Biases. The _eval method runs evaluation without gradients and writes results to text files. Notably, ORPO does not require a separate reference model, simplifying the training setup compared to DPO or KTO.
Usage
Use this trainer when training a language model from paired preference data (chosen vs rejected responses) without maintaining a separate reference model. It is suitable when you want to combine supervised fine-tuning with preference optimization in a single training objective.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/trainer/orpo.py
- Lines: 1-330
Signature
class ORPOTrainer(SLTrainer):
def __init__(
self,
actor: Any,
booster: Booster,
actor_optim: Optimizer,
plugin: Plugin,
actor_lr_scheduler: _LRScheduler,
tokenizer: PreTrainedTokenizerBase,
max_epochs: int = 1,
lam: float = 0.1,
apply_loss_mask: bool = True,
accumulation_steps: int = 1,
start_epoch: int = 0,
save_interval: int = 0,
save_dir: str = None,
coordinator: DistCoordinator = None,
) -> None
Key Methods
def _before_fit(
self,
train_preference_dataloader: DataLoader = None,
eval_preference_dataloader: DataLoader = None,
log_dir: Optional[str] = None,
use_wandb: bool = False,
)
def _train(self, epoch: int)
def _eval(self, epoch: int)
Import
from coati.trainer.orpo import ORPOTrainer
I/O Contract
Inputs (__init__)
| Name | Type | Required | Description |
|---|---|---|---|
| actor | Any | Yes | The actor (policy) model to train |
| booster | Booster | Yes | ColossalAI Booster for distributed training |
| actor_optim | Optimizer | Yes | Optimizer for the actor model |
| plugin | Plugin | Yes | ColossalAI plugin for parallelism strategy |
| actor_lr_scheduler | _LRScheduler | Yes | Learning rate scheduler |
| tokenizer | PreTrainedTokenizerBase | Yes | Tokenizer for encoding |
| lam | float | No | Lambda parameter weighting the odds ratio loss (default: 0.1) |
| apply_loss_mask | bool | No | Whether to apply loss masking (default: True) |
| accumulation_steps | int | No | Gradient accumulation steps (default: 1) |
| save_interval | int | No | Checkpoint saving interval in steps (default: 0, disabled) |
Training Batch Format
| Name | Type | Description |
|---|---|---|
| chosen_input_ids | torch.Tensor | Token IDs for chosen (preferred) responses |
| chosen_attention_mask | torch.Tensor | Attention mask for chosen responses |
| chosen_loss_mask | torch.Tensor | Loss mask for chosen responses |
| reject_input_ids | torch.Tensor | Token IDs for rejected responses |
| reject_attention_mask | torch.Tensor | Attention mask for rejected responses |
| reject_loss_mask | torch.Tensor | Loss mask for rejected responses |
Outputs
| Name | Type | Description |
|---|---|---|
| (none) | None | Training modifies the model in-place; metrics logged to TensorBoard/W&B |
Loss Formula
The ORPO loss combines two objectives:
loss = chosen_nll - lam * odds_ratio_loss
Where chosen_nll is the standard next-token prediction loss on chosen responses and odds_ratio_loss is computed from the log probability ratio between chosen and rejected responses via OddsRatioLoss.
Usage Examples
from coati.trainer.orpo import ORPOTrainer
from colossalai.booster import Booster
from colossalai.booster.plugin import HybridParallelPlugin
plugin = HybridParallelPlugin(tp_size=1, pp_size=1, zero_stage=2)
booster = Booster(plugin=plugin)
trainer = ORPOTrainer(
actor=actor_model,
booster=booster,
actor_optim=optimizer,
plugin=plugin,
actor_lr_scheduler=lr_scheduler,
tokenizer=tokenizer,
max_epochs=3,
lam=0.1,
accumulation_steps=4,
save_interval=500,
save_dir="./checkpoints/orpo",
coordinator=coordinator,
)
trainer.fit(
train_preference_dataloader=train_dataloader,
eval_preference_dataloader=eval_dataloader,
log_dir="./logs",
use_wandb=True,
)