Principle:PacktPublishing LLM Engineers Handbook Supervised Finetuning
| Field | Value |
|---|---|
| Principle Name | Supervised Finetuning |
| Category | Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) |
| Workflow | LLM_Finetuning |
| Repo | PacktPublishing/LLM-Engineers-Handbook |
| Implemented by | Implementation:PacktPublishing_LLM_Engineers_Handbook_SFTTrainer_Train |
Overview
Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) are two complementary training stages used to transform a pre-trained language model into an instruction-following, preference-aligned assistant. SFT teaches the model what to generate; DPO teaches it which generation is better.
Theory
Stage 1: Supervised Fine-Tuning (SFT)
SFT trains the model to follow instructions by optimizing the standard cross-entropy loss on instruction-response pairs. Given an instruction x and target response y = (y_1, y_2, ..., y_T), the model learns to maximize the probability of generating each token conditioned on the instruction and preceding tokens:
L_SFT = -SUM_{t=1}^{T} log P(y_t | y_{<t}, x)
This is the same objective as standard language modeling, but applied specifically to instruction-response formatted data. The model learns to:
- Understand instruction formats (e.g., Alpaca template).
- Generate relevant, coherent responses.
- Follow the stylistic patterns present in the training data.
Stage 2: Direct Preference Optimization (DPO)
After SFT, DPO further aligns the model with human preferences. Instead of training a separate reward model (as in RLHF), DPO directly optimizes the policy using pairs of chosen (preferred) and rejected (non-preferred) responses.
The DPO loss function is:
L_DPO = -log sigma(beta * (log pi_theta(y_w|x) - log pi_ref(y_w|x)
- log pi_theta(y_l|x) + log pi_ref(y_l|x)))
Where:
- pi_theta: The policy model being trained.
- pi_ref: The reference model (typically the SFT checkpoint, kept frozen).
- y_w: The chosen (winning) response.
- y_l: The rejected (losing) response.
- beta: Temperature parameter controlling how strongly the model should differentiate preferences.
- sigma: The sigmoid function.
DPO effectively increases the model's probability of generating chosen responses while decreasing the probability of rejected responses, relative to the reference model.
Two-Stage Training Pipeline
Pre-trained Model
|
v
[SFT Stage] -- Train on instruction-response pairs
| Loss: Cross-entropy
v
SFT Model (capable but unaligned)
|
v
[DPO Stage] -- Train on preference pairs (chosen vs. rejected)
| Loss: DPO preference loss
v
Aligned Model (capable AND preference-aligned)
The two-stage approach:
- SFT first teaches the model to be capable -- it can follow instructions and generate coherent responses.
- DPO second teaches the model to be aligned -- it prefers higher-quality, safer, more helpful responses.
Training Efficiency Techniques
Both stages use several techniques for efficient training on limited hardware:
- Gradient Accumulation: Simulates larger batch sizes by accumulating gradients over multiple forward/backward passes before updating weights. With
gradient_accumulation_steps=8andper_device_train_batch_size=2, the effective batch size is 16. - Mixed Precision Training: Uses BF16 (on supported hardware) or FP16 to reduce memory usage and speed up computation.
- 8-bit Optimizer:
adamw_8bitreduces optimizer state memory by ~50% compared to standard AdamW.
Mathematical Basis
SFT Loss
L_SFT = -SUM_{t=1}^{T} log P_theta(y_t | y_{<t}, x)
DPO Loss
L_DPO = -E_{(x, y_w, y_l) ~ D} [log sigma(beta * (log (pi_theta(y_w|x) / pi_ref(y_w|x))
- log (pi_theta(y_l|x) / pi_ref(y_l|x))))]
When to Use
- When training an LLM to follow instructions (SFT).
- When aligning a model with quality preferences without a separate reward model (DPO).
- When a two-stage training pipeline is desired for progressive capability and alignment.
When Not to Use
- When the model only needs to continue pre-training on raw text (use standard language modeling).
- When a full RLHF pipeline with a reward model is preferred (use PPO instead of DPO).
- When the pre-trained model already demonstrates adequate instruction-following capability.
Related Papers
- InstructGPT: Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback.
- DPO: Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
- TRL: HuggingFace Transformer Reinforcement Learning library.