Implementation:Huggingface Alignment handbook DPOTrainer APO Zero
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for APO-Zero preference alignment using TRL's DPOTrainer with loss_type: apo_zero, as configured by the alignment-handbook SmolLM3 recipe.
Description
DPOTrainer with loss_type="apo_zero" activates the Anchored Preference Optimization Zero loss variant. In the alignment-handbook's SmolLM3 pipeline, this is combined with padding_free=True and use_liger_kernel=True for maximum memory efficiency when training on long sequences.
This is a different angle on the same DPOTrainer class used in standard DPO, but configured specifically for the APO-Zero loss and advanced training features.
Usage
Use this for the preference alignment stage of advanced multi-stage post-training pipelines, particularly when reference-model-free optimization and padding-free training are desired.
Code Reference
Source Location
- Repository: alignment-handbook
- File: scripts/dpo.py (lines 122-130 for DPOTrainer init)
- Config: recipes/smollm3/dpo/apo.yaml (lines 1-66)
Signature
# Same DPOTrainer class, configured with APO-Zero loss
trainer = DPOTrainer(
model, # SFT checkpoint model
ref_model, # Reference model (may be None for APO-Zero)
args=training_args, # DPOConfig with loss_type="apo_zero"
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=...,
processing_class=tokenizer,
peft_config=get_peft_config(model_args),
)
Import
from trl import DPOTrainer, ModelConfig, TrlParser, get_peft_config
from alignment import DPOConfig, ScriptArguments, get_dataset, get_model, get_tokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | AutoModelForCausalLM | Yes | SFT checkpoint model (e.g., SmolLM3-3B-SFT) |
| ref_model | AutoModelForCausalLM | Yes | Reference model (loaded but may have reduced role in APO-Zero) |
| args | DPOConfig | Yes | Training config with loss_type="apo_zero" |
| args.loss_type | str | Yes | Must be "apo_zero" for APO-Zero loss |
| args.beta | float | Yes | APO-Zero beta parameter (e.g., 0.05) |
| args.padding_free | bool | No | Enable padding-free training (True for SmolLM3) |
| args.use_liger_kernel | bool | No | Enable Liger kernel optimization (True for SmolLM3) |
| args.max_length | int | Yes | Max sequence length (e.g., 24576) |
| train_dataset | Dataset | Yes | Preference data with chosen/rejected columns and chat_template_kwargs |
Outputs
| Name | Type | Description |
|---|---|---|
| trainer.train() returns | TrainOutput | Training metrics including APO-Zero loss |
| checkpoints | Files | Saved to training_args.output_dir (e.g., data/SmolLM3-DPO) |
Usage Examples
APO-Zero YAML Config (SmolLM3)
# From recipes/smollm3/dpo/apo.yaml
model_name_or_path: HuggingFaceTB/SmolLM3-3B-checkpoints
model_revision: it-SFT
torch_dtype: bfloat16
attn_implementation: flash_attention_2
trust_remote_code: true
# APO-Zero specific settings
loss_type: apo_zero
beta: 0.05
padding_free: true
use_liger_kernel: true
# Training settings
max_length: 24576
learning_rate: 3.0e-7
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
gradient_checkpointing: true
output_dir: data/SmolLM3-DPO
# Dataset mixture (2 preference splits)
dataset_mixture:
datasets:
- id: HuggingFaceTB/smoltalk2
config: ultrafeedback_think
columns: [chosen, rejected, prompt, chat_template_kwargs]
weight: 1.0
- id: HuggingFaceTB/smoltalk2
config: ultrafeedback_no_think
columns: [chosen, rejected, prompt, chat_template_kwargs]
weight: 1.0
seed: 42
CLI Launch
# APO-Zero DPO with DeepSpeed ZeRO-3 on 8 nodes
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
--num_machines 8 \
scripts/dpo.py \
--config recipes/smollm3/dpo/apo.yaml