Principle:Huggingface Open r1 Supervised Fine Tuning
Overview
A training methodology that adapts pretrained language models to follow instructions or generate specific response patterns by training on curated input-output pairs.
Description
Supervised Fine-Tuning (SFT) is the process of continuing to train a pretrained language model on a labeled dataset of instruction-response pairs. Unlike pretraining (which uses unsupervised next-token prediction on raw text), SFT uses curated examples to steer the model toward desired behavior.
SFT is commonly used for:
- Distilling reasoning capabilities from teacher models (e.g., DeepSeek-R1) into smaller student models.
- Teaching models to follow chat formats by providing structured conversation examples.
- Adapting general models to specific domains using domain-relevant instruction-response pairs.
The training loop handles gradient accumulation, distributed training across multiple GPUs, checkpoint saving/resuming, and integration with logging frameworks.
Usage
Use when you have a dataset of instruction-response pairs and want to teach a model to produce similar responses. Preferred when you have high-quality labeled data and want deterministic training behavior (as opposed to RL-based methods like GRPO).
Theoretical Basis
The core of SFT is cross-entropy loss computed over the target tokens. Given an input sequence x and a target sequence y, the model is trained to minimize:
L = -sum(log P(y_t | y_<t, x))
where P(y_t | y_<t, x) is the model's predicted probability of the correct next token at each position in the target.
Gradient accumulation allows training with effectively larger batch sizes than GPU memory permits. Instead of updating weights after every micro-batch, gradients are accumulated over multiple forward-backward passes before a single optimizer step.
Distributed training (via DeepSpeed ZeRO or FSDP) partitions model states across multiple GPUs to handle models that do not fit in a single GPU's memory.
The pseudocode for the SFT training loop is:
for batch in train_dataset:
loss = cross_entropy(model(batch.input), batch.target)
loss.backward()
if step % gradient_accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
if step % save_steps == 0:
save_checkpoint()
| Concept | Description |
|---|---|
| Cross-entropy loss | Standard classification loss applied token-by-token over the target sequence. Only target tokens contribute to the loss; input/prompt tokens are masked. |
| Gradient accumulation | Accumulates gradients over gradient_accumulation_steps micro-batches before performing a weight update, simulating a larger effective batch size.
|
| Distributed training | Splits model parameters, optimizer states, and/or gradients across GPUs using DeepSpeed ZeRO stages or PyTorch FSDP. |
| Checkpoint saving | Periodically saves model weights, optimizer state, and scheduler state so training can resume from the last checkpoint. |