Principle:ContextualAI HALOs Preference Alignment
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, NLP, Reinforcement_Learning |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
A family of algorithms that steer a language model's outputs toward human preferences by optimizing on feedback data relative to a reference policy.
Description
Preference alignment is the second stage of the LLM alignment pipeline, applied after supervised fine-tuning. Rather than training on a single target response, alignment methods use feedback signals (paired preferences, binary labels, or scalar rewards) to adjust the model so that it produces outputs more consistent with human values.
The HALOs framework implements multiple alignment methods, each corresponding to a different loss function:
- DPO (Direct Preference Optimization) - Optimizes paired preferences via a contrastive sigmoid loss on log-probability ratios
- KTO (Kahneman-Tversky Optimization) - Uses binary (desirable/undesirable) feedback with prospect-theoretic loss
- GRPO (Group Relative Policy Optimization) - Uses group-level advantages from scored completions with clipped ratio objectives
- PPO (Proximal Policy Optimization) - Online RL with a learned value function and reward model
- CDPO, IPO, SimPO, SLiC - Variant paired preference methods with different loss formulations
All methods share a common architecture: a policy model being optimized and (optionally) a frozen reference model whose log probabilities serve as a regularization anchor to prevent the policy from diverging too far.
Usage
Use preference alignment after SFT training, when you have a preference or feedback dataset. Choose the specific method based on your data format:
- Paired preferences (response A > response B) → DPO, CDPO, IPO, SimPO, SLiC
- Binary feedback (desirable/undesirable) → KTO
- Scored completions (reward per response) → GRPO
- Online reward model → PPO
Theoretical Basis
DPO Loss
KTO Loss
For desirable outputs: For undesirable outputs:
Where is the implicit reward.
GRPO Loss
Where are group-normalized advantages.
Humanline Clamping
The HALOs framework introduces humanline clamping, which restricts per-token log-probability ratios to a bounded range, preventing extreme probability shifts on individual tokens.