Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Alignment handbook Multi Task SFT Training

From Leeroopedia


Knowledge Sources
Domains NLP, Deep_Learning, Training
Last Updated 2026-02-07 00:00 GMT

Overview

An advanced supervised fine-tuning approach that trains on a diverse mixture of many task-specific datasets with assistant-only loss masking and thinking mode support.

Description

Multi-Task SFT Training extends standard SFT by training on a large number of diverse datasets simultaneously, using weighted mixing to control the contribution of each task. In the alignment-handbook's SmolLM3 pipeline, the SFT stage trains on 25 different dataset splits covering:

  • General instruction following (conversations, creative writing)
  • Mathematical reasoning (with and without chain-of-thought)
  • Code generation and debugging
  • Structured output (JSON, function calling)
  • Long-context tasks (summarization, document analysis)

Key advanced features beyond standard SFT:

  • Assistant-only loss: Loss is computed only on assistant response tokens, not on user prompts or system messages, improving training signal quality
  • Thinking modes: Datasets are annotated with think or no_think suffixes, and a custom chat template handles <|thinking|> tokens for chain-of-thought reasoning
  • First-Fit-Decreasing packing: An advanced packing strategy (packing_strategy: ffd) that minimizes wasted padding by fitting sequences efficiently into fixed-length bins
  • Very long context: 65536 token max length to handle extended reasoning chains

Usage

Use multi-task SFT when:

  • Training a general-purpose assistant model with diverse capabilities
  • Many task-specific datasets need to be combined with different weights
  • Chain-of-thought reasoning modes need to be supported (think/no_think)
  • The model needs to handle very long contexts (65k+ tokens)
  • Assistant-only loss masking is desired for cleaner training signal

Theoretical Basis

Multi-task SFT extends the standard SFT loss with selective masking:

MTSFT=tassistant tokenslogPθ(xt|x<t)

# Abstract multi-task SFT flow (NOT real implementation)
# Key innovation: assistant_only_loss + thinking modes

# 1. Load 25 dataset splits with varying weights
dataset = mixture([
    (conversations_think, weight=0.3),
    (magpie_ultra_think, weight=0.15),
    (math_no_think, weight=0.5),
    # ... 22 more splits
])

# 2. Apply chat template with thinking mode support
# Template checks chat_template_kwargs for enable_thinking flag
# If thinking: wraps assistant reasoning in <|thinking|>...</|thinking|> tags

# 3. Train with assistant-only loss
# Only compute loss on tokens generated by the assistant role
# User prompts, system messages, and special tokens are masked

# 4. Use FFD packing for efficient GPU utilization
# First-Fit-Decreasing bin-packing minimizes wasted padding

The weight distribution across 25 splits controls the model's capability balance. Higher weights for reasoning datasets produce stronger reasoning; higher weights for conversation datasets produce more natural dialogue.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment