Principle:Unslothai Unsloth Supervised Finetuning

Knowledge Sources	Unsloth TRL SFTTrainer Training Language Models to Follow Instructions
Domains	NLP, Training, Optimization
Last Updated	2026-02-07 00:00 GMT

Overview

A training paradigm that optimizes a language model on curated input-output pairs using standard cross-entropy loss, teaching the model to generate desired responses given specific instructions or prompts.

Description

Supervised Fine-Tuning (SFT) is the most common method for adapting pretrained language models to specific tasks. The model is trained to minimize the negative log-likelihood of target tokens given the input context. In the instruction-tuning variant, the model learns to follow instructions by training on (instruction, response) pairs.

Key aspects of SFT in the Unsloth context:

Fused Cross-Entropy: Unsloth replaces the standard cross-entropy computation with a chunked, fused Triton kernel that avoids materializing the full logits tensor, reducing peak memory by up to 50%.
Padding-Free Batching: Sequences are packed without padding tokens to maximize GPU utilization, with position IDs and attention masks adjusted to prevent cross-contamination between packed sequences.
Separate Embedding Learning Rate: When embed_tokens and lm_head are included in modules_to_save, a separate (typically lower) learning rate is applied to prevent catastrophic forgetting of the embedding layer.
TRL Compatibility: Unsloth patches TRL's SFTTrainer to use its optimized training loop while maintaining full API compatibility.

Usage

Use this principle after model loading and LoRA injection, when training on conversational or instruction datasets. It is the standard training method for QLoRA fine-tuning. For reinforcement learning (GRPO/PPO), use as an optional warmup step before the RL phase.

Theoretical Basis

The SFT objective minimizes cross-entropy loss over response tokens:

$ℒ_{S F T} = - \sum_{t = 1}^{T} \log P_{θ} (y_{t} | y_{< t}, x)$

Where $x$ is the instruction/prompt, $y$ is the target response, and $θ$ are the (LoRA) parameters being optimized.

# Abstract SFT training loop
for batch in dataloader:
    input_ids = batch["input_ids"]
    labels = batch["labels"]        # -100 for instruction tokens (masked)
    outputs = model(input_ids)
    loss = cross_entropy(outputs.logits, labels, ignore_index=-100)
    loss.backward()
    optimizer.step()

The key optimization in Unsloth is the chunked cross-entropy which processes logits in chunks rather than materializing the full [batch, seq_len, vocab_size] tensor.

Related Pages

Implemented By

Implementation:Unslothai_Unsloth_UnslothTrainer

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment