Principle:Huggingface Alignment handbook Multi Task SFT Training
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
An advanced supervised fine-tuning approach that trains on a diverse mixture of many task-specific datasets with assistant-only loss masking and thinking mode support.
Description
Multi-Task SFT Training extends standard SFT by training on a large number of diverse datasets simultaneously, using weighted mixing to control the contribution of each task. In the alignment-handbook's SmolLM3 pipeline, the SFT stage trains on 25 different dataset splits covering:
- General instruction following (conversations, creative writing)
- Mathematical reasoning (with and without chain-of-thought)
- Code generation and debugging
- Structured output (JSON, function calling)
- Long-context tasks (summarization, document analysis)
Key advanced features beyond standard SFT:
- Assistant-only loss: Loss is computed only on assistant response tokens, not on user prompts or system messages, improving training signal quality
- Thinking modes: Datasets are annotated with think or no_think suffixes, and a custom chat template handles <|thinking|> tokens for chain-of-thought reasoning
- First-Fit-Decreasing packing: An advanced packing strategy (packing_strategy: ffd) that minimizes wasted padding by fitting sequences efficiently into fixed-length bins
- Very long context: 65536 token max length to handle extended reasoning chains
Usage
Use multi-task SFT when:
- Training a general-purpose assistant model with diverse capabilities
- Many task-specific datasets need to be combined with different weights
- Chain-of-thought reasoning modes need to be supported (think/no_think)
- The model needs to handle very long contexts (65k+ tokens)
- Assistant-only loss masking is desired for cleaner training signal
Theoretical Basis
Multi-task SFT extends the standard SFT loss with selective masking:
# Abstract multi-task SFT flow (NOT real implementation)
# Key innovation: assistant_only_loss + thinking modes
# 1. Load 25 dataset splits with varying weights
dataset = mixture([
(conversations_think, weight=0.3),
(magpie_ultra_think, weight=0.15),
(math_no_think, weight=0.5),
# ... 22 more splits
])
# 2. Apply chat template with thinking mode support
# Template checks chat_template_kwargs for enable_thinking flag
# If thinking: wraps assistant reasoning in <|thinking|>...</|thinking|> tags
# 3. Train with assistant-only loss
# Only compute loss on tokens generated by the assistant role
# User prompts, system messages, and special tokens are masked
# 4. Use FFD packing for efficient GPU utilization
# First-Fit-Decreasing bin-packing minimizes wasted padding
The weight distribution across 25 splits controls the model's capability balance. Higher weights for reasoning datasets produce stronger reasoning; higher weights for conversation datasets produce more natural dialogue.