Implementation:Huggingface Alignment handbook DPOTrainer Usage
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for preference-based alignment using TRL's DPOTrainer, as configured by the alignment-handbook DPO training script.
Description
DPOTrainer is TRL's implementation of the Direct Preference Optimization algorithm. In the alignment-handbook, it is initialized in scripts/dpo.py with both a policy model and a reference model (both loaded via get_model), along with a preference dataset containing chosen and rejected columns.
The alignment-handbook's DPO script adds:
- Dual model loading (policy + reference, both from the same SFT checkpoint)
- DDP bias buffer hack for torch distributed compatibility
- Removal of messages column if present (DPO uses chosen/rejected instead)
- Pad token fallback to EOS token
Usage
Use this when running the DPO stage of the alignment pipeline, after SFT training has produced a checkpoint. The SFT checkpoint serves as both the initial policy and the frozen reference model.
Code Reference
Source Location
- Repository: alignment-handbook
- File: scripts/dpo.py (lines 122-130 for DPOTrainer init, lines 67-159 for full main function)
Signature
# From scripts/dpo.py:L122-130
trainer = DPOTrainer(
model, # Policy model (AutoModelForCausalLM)
ref_model, # Frozen reference model (AutoModelForCausalLM)
args=training_args, # DPOConfig with beta, max_length, etc.
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=(
dataset[script_args.dataset_test_split]
if training_args.eval_strategy != "no"
else None
),
processing_class=tokenizer, # PreTrainedTokenizer
peft_config=get_peft_config(model_args), # None or LoraConfig
)
Import
from trl import DPOTrainer, ModelConfig, TrlParser, get_peft_config
from alignment import DPOConfig, ScriptArguments, get_dataset, get_model, get_tokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | AutoModelForCausalLM | Yes | Policy model to be optimized (from SFT checkpoint) |
| ref_model | AutoModelForCausalLM | Yes | Frozen reference model (same checkpoint, not updated) |
| args | DPOConfig | Yes | Training hyperparameters including beta, max_length, max_prompt_length |
| args.beta | float | Yes | DPO temperature parameter (e.g., 0.01 for Zephyr, 0.05 for SmolLM3) |
| train_dataset | Dataset | Yes | Preference data with chosen and rejected columns |
| eval_dataset | Dataset | No | Evaluation split (None if eval_strategy="no") |
| processing_class | PreTrainedTokenizer | Yes | Tokenizer with pad_token set |
| peft_config | Optional[PeftConfig] | No | LoRA config (None for full fine-tuning) |
Outputs
| Name | Type | Description |
|---|---|---|
| trainer.train() returns | TrainOutput | Contains global_step, training_loss, metrics |
| checkpoints | Files | Saved to training_args.output_dir |
| metrics | Dict | Training and evaluation metrics (DPO loss, rewards/chosen, rewards/rejected) |
Usage Examples
Full DPO Training Pipeline
from alignment import DPOConfig, ScriptArguments, get_dataset, get_model, get_tokenizer
from trl import DPOTrainer, ModelConfig, TrlParser, get_peft_config
# 1. Parse config
parser = TrlParser((ScriptArguments, DPOConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_and_config()
# 2. Load model (twice: policy + reference), tokenizer, dataset
model = get_model(model_args, training_args)
ref_model = get_model(model_args, training_args)
tokenizer = get_tokenizer(model_args, training_args)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# 3. Load preference dataset
dataset = get_dataset(script_args)
for split in dataset:
if "messages" in dataset[split].column_names:
dataset[split] = dataset[split].remove_columns("messages")
# 4. Initialize DPO trainer
trainer = DPOTrainer(
model,
ref_model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset.get("test"),
processing_class=tokenizer,
peft_config=get_peft_config(model_args),
)
# 5. Train and save
train_result = trainer.train()
trainer.save_model(training_args.output_dir)
CLI Launch
# Full DPO training with ZeRO-3
accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
scripts/dpo.py \
--config recipes/zephyr-7b-beta/dpo/config_full.yaml
# QLoRA DPO on single GPU
accelerate launch --config_file recipes/accelerate_configs/ddp.yaml \
scripts/dpo.py \
--config recipes/zephyr-7b-beta/dpo/config_qlora.yaml