Principle:Unslothai Unsloth Supervised Finetuning
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Optimization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A training paradigm that optimizes a language model on curated input-output pairs using standard cross-entropy loss, teaching the model to generate desired responses given specific instructions or prompts.
Description
Supervised Fine-Tuning (SFT) is the most common method for adapting pretrained language models to specific tasks. The model is trained to minimize the negative log-likelihood of target tokens given the input context. In the instruction-tuning variant, the model learns to follow instructions by training on (instruction, response) pairs.
Key aspects of SFT in the Unsloth context:
- Fused Cross-Entropy: Unsloth replaces the standard cross-entropy computation with a chunked, fused Triton kernel that avoids materializing the full logits tensor, reducing peak memory by up to 50%.
- Padding-Free Batching: Sequences are packed without padding tokens to maximize GPU utilization, with position IDs and attention masks adjusted to prevent cross-contamination between packed sequences.
- Separate Embedding Learning Rate: When embed_tokens and lm_head are included in modules_to_save, a separate (typically lower) learning rate is applied to prevent catastrophic forgetting of the embedding layer.
- TRL Compatibility: Unsloth patches TRL's SFTTrainer to use its optimized training loop while maintaining full API compatibility.
Usage
Use this principle after model loading and LoRA injection, when training on conversational or instruction datasets. It is the standard training method for QLoRA fine-tuning. For reinforcement learning (GRPO/PPO), use as an optional warmup step before the RL phase.
Theoretical Basis
The SFT objective minimizes cross-entropy loss over response tokens:
Where is the instruction/prompt, is the target response, and are the (LoRA) parameters being optimized.
# Abstract SFT training loop
for batch in dataloader:
input_ids = batch["input_ids"]
labels = batch["labels"] # -100 for instruction tokens (masked)
outputs = model(input_ids)
loss = cross_entropy(outputs.logits, labels, ignore_index=-100)
loss.backward()
optimizer.step()
The key optimization in Unsloth is the chunked cross-entropy which processes logits in chunks rather than materializing the full [batch, seq_len, vocab_size] tensor.