Principle:Huggingface Alignment handbook APO Zero Preference Alignment
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A preference optimization variant that anchors the loss function to zero log-ratio for rejected responses, enabling more stable alignment without a separate reference model.
Description
Anchored Preference Optimization Zero (APO-Zero) is a variant of DPO that modifies the loss function to anchor the rejected response's log-probability ratio to zero rather than comparing it against a reference model. This simplifies the training by removing the need for a reference model while maintaining the preference optimization signal.
In the alignment-handbook, APO-Zero is used in the SmolLM3 multi-stage pipeline as the final preference alignment stage after mid-training and SFT. It is configured by setting loss_type: apo_zero in the DPOTrainer configuration, combined with padding-free training and Liger kernel optimization for efficiency.
APO-Zero is particularly effective in the context of advanced post-training pipelines where multiple stages of training have already shaped the model's behavior, and a lighter-weight preference optimization is desired.
Usage
Use APO-Zero when:
- A reference model is not available or too expensive to maintain in memory
- The model has already undergone significant training (mid-training + SFT)
- A more stable preference optimization signal is desired
- Combined with padding-free training for memory efficiency
Theoretical Basis
APO-Zero modifies the standard DPO loss by anchoring the rejected term:
# Abstract APO-Zero vs DPO comparison (NOT real implementation)
# Standard DPO loss:
# L = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
# where log_ratio = log(pi_theta(y|x) / pi_ref(y|x))
# APO-Zero loss:
# L = -log_sigmoid(beta * log_ratio_chosen) + log_sigmoid(beta * log_ratio_rejected)
# Effectively anchors rejected log-ratio to zero
Key differences from standard DPO:
- No coupled comparison: Chosen and rejected are optimized semi-independently
- Anchoring: Rejected responses are pushed toward zero log-ratio rather than being compared relative to chosen
- Stability: The decoupled loss is more stable for models that have already been well-trained
Hyperparameters in the alignment-handbook's SmolLM3 APO-Zero config:
- beta: 0.05 (higher than standard DPO's 0.01, compensating for no reference model)
- max_length: 24576 (shorter than SFT's 65536 due to preference pair overhead)
- padding_free: True (memory optimization for long sequences)