Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:PacktPublishing LLM Engineers Handbook Supervised Finetuning

From Leeroopedia


Field Value
Principle Name Supervised Finetuning
Category Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)
Workflow LLM_Finetuning
Repo PacktPublishing/LLM-Engineers-Handbook
Implemented by Implementation:PacktPublishing_LLM_Engineers_Handbook_SFTTrainer_Train

Overview

Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) are two complementary training stages used to transform a pre-trained language model into an instruction-following, preference-aligned assistant. SFT teaches the model what to generate; DPO teaches it which generation is better.

Theory

Stage 1: Supervised Fine-Tuning (SFT)

SFT trains the model to follow instructions by optimizing the standard cross-entropy loss on instruction-response pairs. Given an instruction x and target response y = (y_1, y_2, ..., y_T), the model learns to maximize the probability of generating each token conditioned on the instruction and preceding tokens:

L_SFT = -SUM_{t=1}^{T} log P(y_t | y_{<t}, x)

This is the same objective as standard language modeling, but applied specifically to instruction-response formatted data. The model learns to:

  • Understand instruction formats (e.g., Alpaca template).
  • Generate relevant, coherent responses.
  • Follow the stylistic patterns present in the training data.

Stage 2: Direct Preference Optimization (DPO)

After SFT, DPO further aligns the model with human preferences. Instead of training a separate reward model (as in RLHF), DPO directly optimizes the policy using pairs of chosen (preferred) and rejected (non-preferred) responses.

The DPO loss function is:

L_DPO = -log sigma(beta * (log pi_theta(y_w|x) - log pi_ref(y_w|x)
                          - log pi_theta(y_l|x) + log pi_ref(y_l|x)))

Where:

  • pi_theta: The policy model being trained.
  • pi_ref: The reference model (typically the SFT checkpoint, kept frozen).
  • y_w: The chosen (winning) response.
  • y_l: The rejected (losing) response.
  • beta: Temperature parameter controlling how strongly the model should differentiate preferences.
  • sigma: The sigmoid function.

DPO effectively increases the model's probability of generating chosen responses while decreasing the probability of rejected responses, relative to the reference model.

Two-Stage Training Pipeline

Pre-trained Model
       |
       v
  [SFT Stage]  -- Train on instruction-response pairs
       |           Loss: Cross-entropy
       v
  SFT Model (capable but unaligned)
       |
       v
  [DPO Stage]  -- Train on preference pairs (chosen vs. rejected)
       |           Loss: DPO preference loss
       v
  Aligned Model (capable AND preference-aligned)

The two-stage approach:

  1. SFT first teaches the model to be capable -- it can follow instructions and generate coherent responses.
  2. DPO second teaches the model to be aligned -- it prefers higher-quality, safer, more helpful responses.

Training Efficiency Techniques

Both stages use several techniques for efficient training on limited hardware:

  • Gradient Accumulation: Simulates larger batch sizes by accumulating gradients over multiple forward/backward passes before updating weights. With gradient_accumulation_steps=8 and per_device_train_batch_size=2, the effective batch size is 16.
  • Mixed Precision Training: Uses BF16 (on supported hardware) or FP16 to reduce memory usage and speed up computation.
  • 8-bit Optimizer: adamw_8bit reduces optimizer state memory by ~50% compared to standard AdamW.

Mathematical Basis

SFT Loss

L_SFT = -SUM_{t=1}^{T} log P_theta(y_t | y_{<t}, x)

DPO Loss

L_DPO = -E_{(x, y_w, y_l) ~ D} [log sigma(beta * (log (pi_theta(y_w|x) / pi_ref(y_w|x))
                                                   - log (pi_theta(y_l|x) / pi_ref(y_l|x))))]

When to Use

  • When training an LLM to follow instructions (SFT).
  • When aligning a model with quality preferences without a separate reward model (DPO).
  • When a two-stage training pipeline is desired for progressive capability and alignment.

When Not to Use

  • When the model only needs to continue pre-training on raw text (use standard language modeling).
  • When a full RLHF pipeline with a reward model is preferred (use PPO instead of DPO).
  • When the pre-trained model already demonstrates adequate instruction-following capability.

Related Papers

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment