Heuristic:PacktPublishing LLM Engineers Handbook DPO Training Configuration

Knowledge Sources	LLM Engineers Handbook DPO: Direct Preference Optimization
Domains	LLMs, Finetuning, Alignment
Last Updated	2026-02-08 08:00 GMT

Overview

DPO-specific training configuration using learning rate 2e-6, beta 0.5, and half-sequence-length constraints for preference-based alignment after SFT.

Description

This heuristic captures the differences between DPO (Direct Preference Optimization) and SFT training configurations. DPO requires a dramatically lower learning rate (2e-6 vs 3e-4 for SFT, a 150x reduction), a beta parameter of 0.5 controlling the KL divergence penalty, and halved sequence lengths for both prompts and completions. These choices prevent the model from diverging too far from the SFT checkpoint while learning human preference alignment.

Usage

Use this heuristic when running the DPO phase of fine-tuning, which always follows an SFT phase. DPO is triggered by setting `finetuning_type="dpo"` in the pipeline configuration. The DPO trainer requires a preference dataset with chosen/rejected answer pairs rather than instruction/answer pairs.

The Insight (Rule of Thumb)

Action: Use a 150x smaller learning rate for DPO compared to SFT, set beta to 0.5, and halve the max sequence length.
Value:
- `learning_rate` = 2e-6 (vs 3e-4 for SFT)
- `beta` = 0.5 (KL divergence penalty strength)
- `max_length` = max_seq_length // 2 = 1024 (vs 2048 for SFT)
- `max_prompt_length` = max_seq_length // 2 = 1024
- `test_size` = 0.05 (5% held out for evaluation)
- `eval_steps` = 0.2 (evaluate every 20% of training)
Trade-off: Lower learning rate means slower convergence but prevents catastrophic forgetting of SFT-learned capabilities. Beta of 0.5 is a moderate penalty; lower values allow more deviation from the reference policy.

Reasoning

DPO fine-tunes on preference pairs (chosen vs rejected), and the model is already well-initialized from SFT. A large learning rate would quickly destroy the SFT-learned patterns. The 150x reduction (2e-6 vs 3e-4) is standard practice for post-SFT alignment. The halved sequence length accounts for DPO needing to process both chosen and rejected completions, effectively doubling memory usage per sample. Beta=0.5 is a balanced choice: too low (0.1) makes the model ignore the reference distribution, too high (1.0) makes learning too conservative.

DPO training arguments from `llm_engineering/model/finetuning/finetune.py:175-194`:

max_length=max_seq_length // 2,
max_prompt_length=max_seq_length // 2,
learning_rate=learning_rate,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
per_device_eval_batch_size=per_device_train_batch_size,
warmup_steps=10,
eval_steps=0.2,
logging_steps=1,

DPO-specific learning rate from `llm_engineering/model/finetuning/finetune.py:309`:

learning_rate=2e-6,

DPO beta and test split from `llm_engineering/model/finetuning/finetune.py:66-67,163`:

beta: float = 0.5
dataset = dataset.train_test_split(test_size=0.05)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment