Principle:ContextualAI HALOs Preference Alignment

Knowledge Sources	Direct Preference Optimization KTO: Model Alignment as Prospect Theoretic Optimization DeepSeekMath: Pushing the Limits of Mathematical Reasoning Training Language Models with Language Feedback at Scale Statistical Rejection Sampling Improves Preference Optimization ContextualAI HALOs
Domains	Deep_Learning, NLP, Reinforcement_Learning
Last Updated	2026-02-08 03:00 GMT

Overview

A family of algorithms that steer a language model's outputs toward human preferences by optimizing on feedback data relative to a reference policy.

Description

Preference alignment is the second stage of the LLM alignment pipeline, applied after supervised fine-tuning. Rather than training on a single target response, alignment methods use feedback signals (paired preferences, binary labels, or scalar rewards) to adjust the model so that it produces outputs more consistent with human values.

The HALOs framework implements multiple alignment methods, each corresponding to a different loss function:

DPO (Direct Preference Optimization) - Optimizes paired preferences via a contrastive sigmoid loss on log-probability ratios
KTO (Kahneman-Tversky Optimization) - Uses binary (desirable/undesirable) feedback with prospect-theoretic loss
GRPO (Group Relative Policy Optimization) - Uses group-level advantages from scored completions with clipped ratio objectives
PPO (Proximal Policy Optimization) - Online RL with a learned value function and reward model
CDPO, IPO, SimPO, SLiC - Variant paired preference methods with different loss formulations

All methods share a common architecture: a policy model being optimized and (optionally) a frozen reference model whose log probabilities serve as a regularization anchor to prevent the policy from diverging too far.

Usage

Use preference alignment after SFT training, when you have a preference or feedback dataset. Choose the specific method based on your data format:

Paired preferences (response A > response B) → DPO, CDPO, IPO, SimPO, SLiC
Binary feedback (desirable/undesirable) → KTO
Scored completions (reward per response) → GRPO
Online reward model → PPO

Theoretical Basis

DPO Loss

$ℒ_{D P O} = - \log σ (β [\log \frac{π_{θ} (y_{w} | x)}{π_{r e f} (y_{w} | x)} - \log \frac{π_{θ} (y_{l} | x)}{π_{r e f} (y_{l} | x)}])$

KTO Loss

For desirable outputs: $ℒ_{K T O}^{+} = 1 - σ (β [r_{θ} (x, y) - K L (π_{θ} ‖ π_{r e f})])$ For undesirable outputs: $ℒ_{K T O}^{-} = 1 - σ (β [K L (π_{θ} ‖ π_{r e f}) - r_{θ} (x, y)])$

Where $r_{θ} (x, y) = \log π_{θ} (y | x) - \log π_{r e f} (y | x)$ is the implicit reward.

GRPO Loss

$ℒ_{G R P O} = - \frac{1}{G} \sum_{i = 1}^{G} {\hat{A}}_{i} \cdot \min (\frac{π_{θ} (y_{i} | x)}{π_{r e f} (y_{i} | x)}, clip (\cdot, 1 - ϵ, 1 + ϵ))$

Where ${\hat{A}}_{i}$ are group-normalized advantages.

Humanline Clamping

The HALOs framework introduces humanline clamping, which restricts per-token log-probability ratios to a bounded range, preventing extreme probability shifts on individual tokens.

Related Pages

Implemented By

Implementation:ContextualAI_HALOs_Alignment_Trainers

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment