Principle:Huggingface Open r1 Supervised Fine Tuning

Overview

A training methodology that adapts pretrained language models to follow instructions or generate specific response patterns by training on curated input-output pairs.

Description

Supervised Fine-Tuning (SFT) is the process of continuing to train a pretrained language model on a labeled dataset of instruction-response pairs. Unlike pretraining (which uses unsupervised next-token prediction on raw text), SFT uses curated examples to steer the model toward desired behavior.

SFT is commonly used for:

Distilling reasoning capabilities from teacher models (e.g., DeepSeek-R1) into smaller student models.
Teaching models to follow chat formats by providing structured conversation examples.
Adapting general models to specific domains using domain-relevant instruction-response pairs.

The training loop handles gradient accumulation, distributed training across multiple GPUs, checkpoint saving/resuming, and integration with logging frameworks.

Usage

Use when you have a dataset of instruction-response pairs and want to teach a model to produce similar responses. Preferred when you have high-quality labeled data and want deterministic training behavior (as opposed to RL-based methods like GRPO).

Theoretical Basis

The core of SFT is cross-entropy loss computed over the target tokens. Given an input sequence x and a target sequence y, the model is trained to minimize:

L = -sum(log P(y_t | y_<t, x))

where P(y_t | y_<t, x) is the model's predicted probability of the correct next token at each position in the target.

Gradient accumulation allows training with effectively larger batch sizes than GPU memory permits. Instead of updating weights after every micro-batch, gradients are accumulated over multiple forward-backward passes before a single optimizer step.

Distributed training (via DeepSpeed ZeRO or FSDP) partitions model states across multiple GPUs to handle models that do not fit in a single GPU's memory.

The pseudocode for the SFT training loop is:

for batch in train_dataset:
    loss = cross_entropy(model(batch.input), batch.target)
    loss.backward()
    if step % gradient_accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
    if step % save_steps == 0:
        save_checkpoint()

Concept	Description
Cross-entropy loss	Standard classification loss applied token-by-token over the target sequence. Only target tokens contribute to the loss; input/prompt tokens are masked.
Gradient accumulation	Accumulates gradients over `gradient_accumulation_steps` micro-batches before performing a weight update, simulating a larger effective batch size.
Distributed training	Splits model parameters, optimizer states, and/or gradients across GPUs using DeepSpeed ZeRO stages or PyTorch FSDP.
Checkpoint saving	Periodically saves model weights, optimizer state, and scheduler state so training can resume from the last checkpoint.

Related Pages

Implementation

Implementation:Huggingface_Open_r1_SFTTrainer_Usage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment