Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ContextualAI HALOs Preference Alignment

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, NLP, Reinforcement_Learning
Last Updated 2026-02-08 03:00 GMT

Overview

A family of algorithms that steer a language model's outputs toward human preferences by optimizing on feedback data relative to a reference policy.

Description

Preference alignment is the second stage of the LLM alignment pipeline, applied after supervised fine-tuning. Rather than training on a single target response, alignment methods use feedback signals (paired preferences, binary labels, or scalar rewards) to adjust the model so that it produces outputs more consistent with human values.

The HALOs framework implements multiple alignment methods, each corresponding to a different loss function:

  • DPO (Direct Preference Optimization) - Optimizes paired preferences via a contrastive sigmoid loss on log-probability ratios
  • KTO (Kahneman-Tversky Optimization) - Uses binary (desirable/undesirable) feedback with prospect-theoretic loss
  • GRPO (Group Relative Policy Optimization) - Uses group-level advantages from scored completions with clipped ratio objectives
  • PPO (Proximal Policy Optimization) - Online RL with a learned value function and reward model
  • CDPO, IPO, SimPO, SLiC - Variant paired preference methods with different loss formulations

All methods share a common architecture: a policy model being optimized and (optionally) a frozen reference model whose log probabilities serve as a regularization anchor to prevent the policy from diverging too far.

Usage

Use preference alignment after SFT training, when you have a preference or feedback dataset. Choose the specific method based on your data format:

  • Paired preferences (response A > response B) → DPO, CDPO, IPO, SimPO, SLiC
  • Binary feedback (desirable/undesirable) → KTO
  • Scored completions (reward per response) → GRPO
  • Online reward model → PPO

Theoretical Basis

DPO Loss

DPO=logσ(β[logπθ(yw|x)πref(yw|x)logπθ(yl|x)πref(yl|x)])

KTO Loss

For desirable outputs: KTO+=1σ(β[rθ(x,y)KL(πθπref)]) For undesirable outputs: KTO=1σ(β[KL(πθπref)rθ(x,y)])

Where rθ(x,y)=logπθ(y|x)logπref(y|x) is the implicit reward.

GRPO Loss

GRPO=1Gi=1GA^imin(πθ(yi|x)πref(yi|x),clip(,1ϵ,1+ϵ))

Where A^i are group-normalized advantages.

Humanline Clamping

The HALOs framework introduces humanline clamping, which restricts per-token log-probability ratios to a bounded range, preventing extreme probability shifts on individual tokens.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment