Principle:PacktPublishing LLM Engineers Handbook Supervised Finetuning

Field	Value
Principle Name	Supervised Finetuning
Category	Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)
Workflow	LLM_Finetuning
Repo	PacktPublishing/LLM-Engineers-Handbook
Implemented by	Implementation:PacktPublishing_LLM_Engineers_Handbook_SFTTrainer_Train

Overview

Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) are two complementary training stages used to transform a pre-trained language model into an instruction-following, preference-aligned assistant. SFT teaches the model what to generate; DPO teaches it which generation is better.

Theory

Stage 1: Supervised Fine-Tuning (SFT)

SFT trains the model to follow instructions by optimizing the standard cross-entropy loss on instruction-response pairs. Given an instruction x and target response y = (y_1, y_2, ..., y_T), the model learns to maximize the probability of generating each token conditioned on the instruction and preceding tokens:

L_SFT = -SUM_{t=1}^{T} log P(y_t | y_{<t}, x)

This is the same objective as standard language modeling, but applied specifically to instruction-response formatted data. The model learns to:

Understand instruction formats (e.g., Alpaca template).
Generate relevant, coherent responses.
Follow the stylistic patterns present in the training data.

Stage 2: Direct Preference Optimization (DPO)

After SFT, DPO further aligns the model with human preferences. Instead of training a separate reward model (as in RLHF), DPO directly optimizes the policy using pairs of chosen (preferred) and rejected (non-preferred) responses.

The DPO loss function is:

L_DPO = -log sigma(beta * (log pi_theta(y_w|x) - log pi_ref(y_w|x)
                          - log pi_theta(y_l|x) + log pi_ref(y_l|x)))

Where:

pi_theta: The policy model being trained.
pi_ref: The reference model (typically the SFT checkpoint, kept frozen).
y_w: The chosen (winning) response.
y_l: The rejected (losing) response.
beta: Temperature parameter controlling how strongly the model should differentiate preferences.
sigma: The sigmoid function.

DPO effectively increases the model's probability of generating chosen responses while decreasing the probability of rejected responses, relative to the reference model.

Two-Stage Training Pipeline

Pre-trained Model
       |
       v
  [SFT Stage]  -- Train on instruction-response pairs
       |           Loss: Cross-entropy
       v
  SFT Model (capable but unaligned)
       |
       v
  [DPO Stage]  -- Train on preference pairs (chosen vs. rejected)
       |           Loss: DPO preference loss
       v
  Aligned Model (capable AND preference-aligned)

The two-stage approach:

SFT first teaches the model to be capable -- it can follow instructions and generate coherent responses.
DPO second teaches the model to be aligned -- it prefers higher-quality, safer, more helpful responses.

Training Efficiency Techniques

Both stages use several techniques for efficient training on limited hardware:

Gradient Accumulation: Simulates larger batch sizes by accumulating gradients over multiple forward/backward passes before updating weights. With gradient_accumulation_steps=8 and per_device_train_batch_size=2, the effective batch size is 16.
Mixed Precision Training: Uses BF16 (on supported hardware) or FP16 to reduce memory usage and speed up computation.
8-bit Optimizer: adamw_8bit reduces optimizer state memory by ~50% compared to standard AdamW.

Mathematical Basis

SFT Loss

L_SFT = -SUM_{t=1}^{T} log P_theta(y_t | y_{<t}, x)

DPO Loss

L_DPO = -E_{(x, y_w, y_l) ~ D} [log sigma(beta * (log (pi_theta(y_w|x) / pi_ref(y_w|x))
                                                   - log (pi_theta(y_l|x) / pi_ref(y_l|x))))]

When to Use

When training an LLM to follow instructions (SFT).
When aligning a model with quality preferences without a separate reward model (DPO).
When a two-stage training pipeline is desired for progressive capability and alignment.

When Not to Use

When the model only needs to continue pre-training on raw text (use standard language modeling).
When a full RLHF pipeline with a reward model is preferred (use PPO instead of DPO).
When the pre-trained model already demonstrates adequate instruction-following capability.

Related Papers

InstructGPT: Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback.
DPO: Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
TRL: HuggingFace Transformer Reinforcement Learning library.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment