Implementation:Huggingface Alignment handbook DPOTrainer APO Zero

Knowledge Sources	Alignment Handbook TRL DPOTrainer Anchored Preference Optimization
Domains	NLP, Deep_Learning, Reinforcement_Learning
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for APO-Zero preference alignment using TRL's DPOTrainer with loss_type: apo_zero, as configured by the alignment-handbook SmolLM3 recipe.

Description

DPOTrainer with loss_type="apo_zero" activates the Anchored Preference Optimization Zero loss variant. In the alignment-handbook's SmolLM3 pipeline, this is combined with padding_free=True and use_liger_kernel=True for maximum memory efficiency when training on long sequences.

This is a different angle on the same DPOTrainer class used in standard DPO, but configured specifically for the APO-Zero loss and advanced training features.

Usage

Use this for the preference alignment stage of advanced multi-stage post-training pipelines, particularly when reference-model-free optimization and padding-free training are desired.

Code Reference

Source Location

Repository: alignment-handbook
File: scripts/dpo.py (lines 122-130 for DPOTrainer init)
Config: recipes/smollm3/dpo/apo.yaml (lines 1-66)

Signature

# Same DPOTrainer class, configured with APO-Zero loss
trainer = DPOTrainer(
    model,                          # SFT checkpoint model
    ref_model,                      # Reference model (may be None for APO-Zero)
    args=training_args,             # DPOConfig with loss_type="apo_zero"
    train_dataset=dataset[script_args.dataset_train_split],
    eval_dataset=...,
    processing_class=tokenizer,
    peft_config=get_peft_config(model_args),
)

Import

from trl import DPOTrainer, ModelConfig, TrlParser, get_peft_config
from alignment import DPOConfig, ScriptArguments, get_dataset, get_model, get_tokenizer

I/O Contract

Inputs

Name	Type	Required	Description
model	AutoModelForCausalLM	Yes	SFT checkpoint model (e.g., SmolLM3-3B-SFT)
ref_model	AutoModelForCausalLM	Yes	Reference model (loaded but may have reduced role in APO-Zero)
args	DPOConfig	Yes	Training config with loss_type="apo_zero"
args.loss_type	str	Yes	Must be "apo_zero" for APO-Zero loss
args.beta	float	Yes	APO-Zero beta parameter (e.g., 0.05)
args.padding_free	bool	No	Enable padding-free training (True for SmolLM3)
args.use_liger_kernel	bool	No	Enable Liger kernel optimization (True for SmolLM3)
args.max_length	int	Yes	Max sequence length (e.g., 24576)
train_dataset	Dataset	Yes	Preference data with chosen/rejected columns and chat_template_kwargs

Outputs

Name	Type	Description
trainer.train() returns	TrainOutput	Training metrics including APO-Zero loss
checkpoints	Files	Saved to training_args.output_dir (e.g., data/SmolLM3-DPO)

Usage Examples

APO-Zero YAML Config (SmolLM3)

# From recipes/smollm3/dpo/apo.yaml
model_name_or_path: HuggingFaceTB/SmolLM3-3B-checkpoints
model_revision: it-SFT
torch_dtype: bfloat16
attn_implementation: flash_attention_2
trust_remote_code: true

# APO-Zero specific settings
loss_type: apo_zero
beta: 0.05
padding_free: true
use_liger_kernel: true

# Training settings
max_length: 24576
learning_rate: 3.0e-7
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
gradient_checkpointing: true
output_dir: data/SmolLM3-DPO

# Dataset mixture (2 preference splits)
dataset_mixture:
  datasets:
    - id: HuggingFaceTB/smoltalk2
      config: ultrafeedback_think
      columns: [chosen, rejected, prompt, chat_template_kwargs]
      weight: 1.0
    - id: HuggingFaceTB/smoltalk2
      config: ultrafeedback_no_think
      columns: [chosen, rejected, prompt, chat_template_kwargs]
      weight: 1.0
  seed: 42

CLI Launch

# APO-Zero DPO with DeepSpeed ZeRO-3 on 8 nodes
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
    --num_machines 8 \
    scripts/dpo.py \
    --config recipes/smollm3/dpo/apo.yaml

Related Pages

Implements Principle

Principle:Huggingface_Alignment_handbook_APO_Zero_Preference_Alignment

Requires Environment

Environment:Huggingface_Alignment_handbook_DeepSpeed_Multi_Node

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment