Principle:Huggingface Alignment handbook APO Zero Preference Alignment

Knowledge Sources	Alignment Handbook Anchored Preference Optimization Direct Preference Optimization
Domains	NLP, Deep_Learning, Reinforcement_Learning
Last Updated	2026-02-07 00:00 GMT

Overview

A preference optimization variant that anchors the loss function to zero log-ratio for rejected responses, enabling more stable alignment without a separate reference model.

Description

Anchored Preference Optimization Zero (APO-Zero) is a variant of DPO that modifies the loss function to anchor the rejected response's log-probability ratio to zero rather than comparing it against a reference model. This simplifies the training by removing the need for a reference model while maintaining the preference optimization signal.

In the alignment-handbook, APO-Zero is used in the SmolLM3 multi-stage pipeline as the final preference alignment stage after mid-training and SFT. It is configured by setting loss_type: apo_zero in the DPOTrainer configuration, combined with padding-free training and Liger kernel optimization for efficiency.

APO-Zero is particularly effective in the context of advanced post-training pipelines where multiple stages of training have already shaped the model's behavior, and a lighter-weight preference optimization is desired.

Usage

Use APO-Zero when:

A reference model is not available or too expensive to maintain in memory
The model has already undergone significant training (mid-training + SFT)
A more stable preference optimization signal is desired
Combined with padding-free training for memory efficiency

Theoretical Basis

APO-Zero modifies the standard DPO loss by anchoring the rejected term:

# Abstract APO-Zero vs DPO comparison (NOT real implementation)

# Standard DPO loss:
# L = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
# where log_ratio = log(pi_theta(y|x) / pi_ref(y|x))

# APO-Zero loss:
# L = -log_sigmoid(beta * log_ratio_chosen) + log_sigmoid(beta * log_ratio_rejected)
# Effectively anchors rejected log-ratio to zero

Key differences from standard DPO:

No coupled comparison: Chosen and rejected are optimized semi-independently
Anchoring: Rejected responses are pushed toward zero log-ratio rather than being compared relative to chosen
Stability: The decoupled loss is more stable for models that have already been well-trained

Hyperparameters in the alignment-handbook's SmolLM3 APO-Zero config:

beta: 0.05 (higher than standard DPO's 0.01, compensating for no reference model)
max_length: 24576 (shorter than SFT's 65536 due to preference pair overhead)
padding_free: True (memory optimization for long sequences)

Related Pages

Implemented By

Implementation:Huggingface_Alignment_handbook_DPOTrainer_APO_Zero

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment