Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Alignment handbook APO Zero Preference Alignment

From Leeroopedia


Knowledge Sources
Domains NLP, Deep_Learning, Reinforcement_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

A preference optimization variant that anchors the loss function to zero log-ratio for rejected responses, enabling more stable alignment without a separate reference model.

Description

Anchored Preference Optimization Zero (APO-Zero) is a variant of DPO that modifies the loss function to anchor the rejected response's log-probability ratio to zero rather than comparing it against a reference model. This simplifies the training by removing the need for a reference model while maintaining the preference optimization signal.

In the alignment-handbook, APO-Zero is used in the SmolLM3 multi-stage pipeline as the final preference alignment stage after mid-training and SFT. It is configured by setting loss_type: apo_zero in the DPOTrainer configuration, combined with padding-free training and Liger kernel optimization for efficiency.

APO-Zero is particularly effective in the context of advanced post-training pipelines where multiple stages of training have already shaped the model's behavior, and a lighter-weight preference optimization is desired.

Usage

Use APO-Zero when:

  • A reference model is not available or too expensive to maintain in memory
  • The model has already undergone significant training (mid-training + SFT)
  • A more stable preference optimization signal is desired
  • Combined with padding-free training for memory efficiency

Theoretical Basis

APO-Zero modifies the standard DPO loss by anchoring the rejected term:

# Abstract APO-Zero vs DPO comparison (NOT real implementation)

# Standard DPO loss:
# L = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
# where log_ratio = log(pi_theta(y|x) / pi_ref(y|x))

# APO-Zero loss:
# L = -log_sigmoid(beta * log_ratio_chosen) + log_sigmoid(beta * log_ratio_rejected)
# Effectively anchors rejected log-ratio to zero

Key differences from standard DPO:

  • No coupled comparison: Chosen and rejected are optimized semi-independently
  • Anchoring: Rejected responses are pushed toward zero log-ratio rather than being compared relative to chosen
  • Stability: The decoupled loss is more stable for models that have already been well-trained

Hyperparameters in the alignment-handbook's SmolLM3 APO-Zero config:

  • beta: 0.05 (higher than standard DPO's 0.01, compensating for no reference model)
  • max_length: 24576 (shorter than SFT's 65536 due to preference pair overhead)
  • padding_free: True (memory optimization for long sequences)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment