Principle:Huggingface Alignment handbook Reasoning Mid Training

Knowledge Sources	Alignment Handbook Scaling LLM Test-Time Compute SmolLM3 Blog
Domains	NLP, Deep_Learning, Training
Last Updated	2026-02-07 00:00 GMT

Overview

A continued pretraining stage that enhances a base language model's reasoning capabilities by training on curated reasoning-heavy datasets before instruction fine-tuning.

Description

Reasoning Mid-Training is an intermediate training stage between base pretraining and supervised fine-tuning. It exposes the model to high-quality reasoning data (chain-of-thought solutions, mathematical proofs, code reasoning) to strengthen the model's ability to perform multi-step reasoning before it learns to follow conversational instructions.

In the alignment-handbook's SmolLM3 pipeline, mid-training uses a mixture of two reasoning datasets (Llama_Nemotron and OpenThoughts3) with sequence packing at 32k context length. This stage bridges the gap between a general-purpose base model and a reasoning-capable instruction follower.

The key insight is that reasoning capabilities benefit from dedicated training on reasoning data before the model is shaped by instruction-following objectives. Mid-training builds the reasoning capability, while SFT teaches the model to apply that capability in conversational format.

Usage

Use reasoning mid-training when:

Building a model with strong reasoning capabilities (math, code, logical reasoning)
The base model needs additional domain exposure before instruction tuning
A multi-stage training pipeline is being used (mid-training → SFT → DPO)
Long-context reasoning data is available (32k+ tokens)

Theoretical Basis

Mid-training uses the same SFT objective (next-token prediction) but on reasoning-specific data:

# Abstract mid-training flow (NOT real implementation)
# Stage position: Base Model → [Mid-Training] → SFT → DPO

# Key differences from standard SFT:
# 1. Data: Reasoning-heavy datasets (not conversational instructions)
# 2. Context: Longer sequences (32k vs 4k for standard SFT)
# 3. Packing: Enabled to maximize GPU utilization on varied-length reasoning chains
# 4. Multiple epochs: 5 epochs (vs 1-2 for standard SFT)

model = load(base_model)  # e.g., SmolLM3-3B-Base
dataset = mixture([Llama_Nemotron, OpenThoughts3])
train(model, dataset, max_length=32768, packing=True, epochs=5)
# Output: mid-trained model with enhanced reasoning

Key training features for mid-training:

Sequence packing: Packs multiple reasoning examples into 32k-token sequences
Liger kernel: Fused operations for memory-efficient long-sequence training
trust_remote_code: Required for custom SmolLM3 architecture
Multi-epoch: 5 epochs to deeply learn reasoning patterns

Related Pages

Implemented By

Implementation:Huggingface_Alignment_handbook_SFTTrainer_Mid_Training

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment