Principle:Huggingface Alignment handbook Multi Task SFT Training

Knowledge Sources	Alignment Handbook Training language models to follow instructions SmolLM3 Blog
Domains	NLP, Deep_Learning, Training
Last Updated	2026-02-07 00:00 GMT

Overview

An advanced supervised fine-tuning approach that trains on a diverse mixture of many task-specific datasets with assistant-only loss masking and thinking mode support.

Description

Multi-Task SFT Training extends standard SFT by training on a large number of diverse datasets simultaneously, using weighted mixing to control the contribution of each task. In the alignment-handbook's SmolLM3 pipeline, the SFT stage trains on 25 different dataset splits covering:

General instruction following (conversations, creative writing)
Mathematical reasoning (with and without chain-of-thought)
Code generation and debugging
Structured output (JSON, function calling)
Long-context tasks (summarization, document analysis)

Key advanced features beyond standard SFT:

Assistant-only loss: Loss is computed only on assistant response tokens, not on user prompts or system messages, improving training signal quality
Thinking modes: Datasets are annotated with think or no_think suffixes, and a custom chat template handles <|thinking|> tokens for chain-of-thought reasoning
First-Fit-Decreasing packing: An advanced packing strategy (packing_strategy: ffd) that minimizes wasted padding by fitting sequences efficiently into fixed-length bins
Very long context: 65536 token max length to handle extended reasoning chains

Usage

Use multi-task SFT when:

Training a general-purpose assistant model with diverse capabilities
Many task-specific datasets need to be combined with different weights
Chain-of-thought reasoning modes need to be supported (think/no_think)
The model needs to handle very long contexts (65k+ tokens)
Assistant-only loss masking is desired for cleaner training signal

Theoretical Basis

Multi-task SFT extends the standard SFT loss with selective masking:

$ℒ_{M T - S F T} = - \sum_{t \in assistant tokens} \log P_{θ} (x_{t} | x_{< t})$

# Abstract multi-task SFT flow (NOT real implementation)
# Key innovation: assistant_only_loss + thinking modes

# 1. Load 25 dataset splits with varying weights
dataset = mixture([
    (conversations_think, weight=0.3),
    (magpie_ultra_think, weight=0.15),
    (math_no_think, weight=0.5),
    # ... 22 more splits
])

# 2. Apply chat template with thinking mode support
# Template checks chat_template_kwargs for enable_thinking flag
# If thinking: wraps assistant reasoning in <|thinking|>...</|thinking|> tags

# 3. Train with assistant-only loss
# Only compute loss on tokens generated by the assistant role
# User prompts, system messages, and special tokens are masked

# 4. Use FFD packing for efficient GPU utilization
# First-Fit-Decreasing bin-packing minimizes wasted padding

The weight distribution across 25 splits controls the model's capability balance. Higher weights for reasoning datasets produce stronger reasoning; higher weights for conversation datasets produce more natural dialogue.

Related Pages

Implemented By

Implementation:Huggingface_Alignment_handbook_SFTTrainer_Multi_Task

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment