Principle:Huggingface Alignment handbook Reasoning Mid Training
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Training |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A continued pretraining stage that enhances a base language model's reasoning capabilities by training on curated reasoning-heavy datasets before instruction fine-tuning.
Description
Reasoning Mid-Training is an intermediate training stage between base pretraining and supervised fine-tuning. It exposes the model to high-quality reasoning data (chain-of-thought solutions, mathematical proofs, code reasoning) to strengthen the model's ability to perform multi-step reasoning before it learns to follow conversational instructions.
In the alignment-handbook's SmolLM3 pipeline, mid-training uses a mixture of two reasoning datasets (Llama_Nemotron and OpenThoughts3) with sequence packing at 32k context length. This stage bridges the gap between a general-purpose base model and a reasoning-capable instruction follower.
The key insight is that reasoning capabilities benefit from dedicated training on reasoning data before the model is shaped by instruction-following objectives. Mid-training builds the reasoning capability, while SFT teaches the model to apply that capability in conversational format.
Usage
Use reasoning mid-training when:
- Building a model with strong reasoning capabilities (math, code, logical reasoning)
- The base model needs additional domain exposure before instruction tuning
- A multi-stage training pipeline is being used (mid-training → SFT → DPO)
- Long-context reasoning data is available (32k+ tokens)
Theoretical Basis
Mid-training uses the same SFT objective (next-token prediction) but on reasoning-specific data:
# Abstract mid-training flow (NOT real implementation)
# Stage position: Base Model → [Mid-Training] → SFT → DPO
# Key differences from standard SFT:
# 1. Data: Reasoning-heavy datasets (not conversational instructions)
# 2. Context: Longer sequences (32k vs 4k for standard SFT)
# 3. Packing: Enabled to maximize GPU utilization on varied-length reasoning chains
# 4. Multiple epochs: 5 epochs (vs 1-2 for standard SFT)
model = load(base_model) # e.g., SmolLM3-3B-Base
dataset = mixture([Llama_Nemotron, OpenThoughts3])
train(model, dataset, max_length=32768, packing=True, epochs=5)
# Output: mid-trained model with enhanced reasoning
Key training features for mid-training:
- Sequence packing: Packs multiple reasoning examples into 32k-token sequences
- Liger kernel: Fused operations for memory-efficient long-sequence training
- trust_remote_code: Required for custom SmolLM3 architecture
- Multi-epoch: 5 epochs to deeply learn reasoning patterns