Principle:Hiyouga LLaMA Factory Supervised Fine Tuning

Knowledge Sources	Hiyouga_LLaMA_Factory Training language models to follow instructions with human feedback
Domains	Natural Language Processing, Transfer Learning, Language Model Alignment
Last Updated	2026-02-06 19:00 GMT

Overview

A post-pretraining training paradigm that adapts a pretrained language model to follow instructions and perform specific tasks by training on curated input-output demonstration pairs.

Description

Supervised Fine-Tuning (SFT) is the process of continuing the training of a pretrained language model on a labeled dataset of prompt-completion pairs. Unlike pretraining, which learns general language representations from raw text via next-token prediction, SFT focuses on teaching the model to produce desired outputs for given inputs, typically instruction-response pairs.

SFT occupies a central position in the modern LLM alignment pipeline. After a foundation model is pretrained on large corpora, SFT is typically the first alignment stage, followed optionally by preference optimization methods such as RLHF, DPO, or KTO. SFT transforms a general-purpose text completion model into an instruction-following assistant.

The key challenges in SFT include:

Data quality: The quality and diversity of demonstration examples directly determines model capability.
Catastrophic forgetting: Excessive fine-tuning can degrade the general knowledge acquired during pretraining.
Loss masking: Only the response portion of each example should contribute to the training loss; the prompt tokens are masked with a special ignore index.
Multi-turn handling: Real conversations involve multiple user-assistant turns, requiring careful label construction across turn boundaries.

Usage

Use SFT when you want to:

Adapt a pretrained model to follow instructions in a specific format or domain.
Teach a model to perform a particular task (e.g., summarization, translation, code generation).
Create a base aligned model before applying preference optimization (DPO, KTO, PPO).
Fine-tune with either full parameters, frozen layers, or parameter-efficient methods like LoRA.

SFT is appropriate when you have high-quality labeled examples of desired model behavior and want deterministic, demonstration-driven learning rather than reward-based optimization.

Theoretical Basis

Training Objective

SFT minimizes the standard cross-entropy loss (equivalently, negative log-likelihood) over the response tokens only:

$ℒ_{SFT} = - \sum_{t \in ℛ} \log P_{θ} (y_{t} ∣ y_{< t}, x)$

where $x$ is the prompt (input tokens), $y$ is the response (output tokens), $ℛ$ is the set of response token positions, and $θ$ are the model parameters. Prompt tokens are excluded from the loss computation by assigning them a special IGNORE_INDEX label (typically -100), ensuring the model is only penalized for its predictions on the response portion.

Label Masking

For a sequence of length $T$ containing both prompt tokens $[x_{1}, \dots, x_{m}]$ and response tokens $[y_{1}, \dots, y_{n}]$ , the label vector is constructed as:

Failed to parse (unknown function "\begin{cases}"): {\displaystyle \text{labels}_t = \begin{cases} \text{IGNORE\_INDEX} & \text{if } t \leq m \\ y_t & \text{if } t > m \end{cases} }

In multi-turn conversations, this masking may apply independently to each turn's prompt segment, optionally masking earlier conversation history to give higher priority to later turns.

Evaluation Metrics

SFT models are evaluated using two complementary approaches:

Token-level accuracy: The fraction of response positions where the model's top-1 prediction matches the ground truth label, computed by comparing $\arg \max P_{θ} (y_{t} ∣ y_{< t}, x)$ against $y_{t}$ .
Generation-quality metrics: When using predict-with-generate mode, the model generates text autoregressively and is evaluated against reference text using ROUGE (ROUGE-1, ROUGE-2, ROUGE-L) and BLEU-4 scores.

Sequence Packing

To maximize GPU utilization, multiple shorter examples can be packed into a single sequence up to the maximum context length. A greedy knapsack algorithm assigns examples to bins such that the total length per bin does not exceed the cutoff, minimizing wasted padding tokens. During packing, block-diagonal attention masks ensure that tokens from different examples do not attend to each other.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment