Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Alignment handbook Supervised Finetuning

From Leeroopedia


Knowledge Sources
Domains NLP, Deep_Learning, Training
Last Updated 2026-02-07 00:00 GMT

Overview

A training technique that adapts a pretrained language model to follow instructions by training on curated demonstration data with a standard cross-entropy language modeling objective.

Description

Supervised Fine-Tuning (SFT) is the first stage of the RLHF alignment pipeline. It takes a pretrained base model and trains it on high-quality instruction-response pairs to teach the model to follow human instructions. The training uses standard next-token prediction (causal language modeling) on formatted conversation data.

SFT addresses the gap between a pretrained model's capability (predicting next tokens in web text) and the desired behavior (following user instructions helpfully and safely). By training on curated demonstrations, the model learns the expected input-output format and develops instruction-following ability.

In the alignment-handbook, SFT serves as the foundation for subsequent preference optimization stages (DPO, ORPO). The SFT checkpoint becomes the starting point for preference learning.

Usage

Use supervised fine-tuning when:

  • Adapting a base pretrained model to follow conversational instructions
  • Creating the first stage of a multi-stage alignment pipeline (SFT → DPO)
  • Fine-tuning on domain-specific instruction data
  • The training data consists of demonstration conversations with clear input-output pairs

Theoretical Basis

SFT minimizes the standard cross-entropy loss over the training data:

SFT=t=1TlogPθ(xt|x<t)

Where xt is the token at position t and θ are the model parameters. When assistant_only_loss is enabled, the loss is computed only over assistant response tokens, not the prompt/user tokens:

# Abstract SFT algorithm (NOT real implementation)
for batch in training_data:
    tokens = tokenize(format_conversation(batch))
    if assistant_only_loss:
        loss_mask = create_assistant_mask(tokens)
    else:
        loss_mask = ones_like(tokens)
    logits = model(tokens)
    loss = cross_entropy(logits, tokens, mask=loss_mask)
    loss.backward()
    optimizer.step()

Key training features in the alignment-handbook:

  • Sequence packing: Multiple short conversations are packed into a single sequence to maximize GPU utilization
  • Gradient checkpointing: Trades compute for memory by recomputing activations during backward pass
  • Chat template formatting: Conversations are formatted using Jinja2 templates before tokenization

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment