Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Alignment handbook Reasoning Mid Training

From Leeroopedia


Knowledge Sources
Domains NLP, Deep_Learning, Training
Last Updated 2026-02-07 00:00 GMT

Overview

A continued pretraining stage that enhances a base language model's reasoning capabilities by training on curated reasoning-heavy datasets before instruction fine-tuning.

Description

Reasoning Mid-Training is an intermediate training stage between base pretraining and supervised fine-tuning. It exposes the model to high-quality reasoning data (chain-of-thought solutions, mathematical proofs, code reasoning) to strengthen the model's ability to perform multi-step reasoning before it learns to follow conversational instructions.

In the alignment-handbook's SmolLM3 pipeline, mid-training uses a mixture of two reasoning datasets (Llama_Nemotron and OpenThoughts3) with sequence packing at 32k context length. This stage bridges the gap between a general-purpose base model and a reasoning-capable instruction follower.

The key insight is that reasoning capabilities benefit from dedicated training on reasoning data before the model is shaped by instruction-following objectives. Mid-training builds the reasoning capability, while SFT teaches the model to apply that capability in conversational format.

Usage

Use reasoning mid-training when:

  • Building a model with strong reasoning capabilities (math, code, logical reasoning)
  • The base model needs additional domain exposure before instruction tuning
  • A multi-stage training pipeline is being used (mid-training → SFT → DPO)
  • Long-context reasoning data is available (32k+ tokens)

Theoretical Basis

Mid-training uses the same SFT objective (next-token prediction) but on reasoning-specific data:

# Abstract mid-training flow (NOT real implementation)
# Stage position: Base Model → [Mid-Training] → SFT → DPO

# Key differences from standard SFT:
# 1. Data: Reasoning-heavy datasets (not conversational instructions)
# 2. Context: Longer sequences (32k vs 4k for standard SFT)
# 3. Packing: Enabled to maximize GPU utilization on varied-length reasoning chains
# 4. Multiple epochs: 5 epochs (vs 1-2 for standard SFT)

model = load(base_model)  # e.g., SmolLM3-3B-Base
dataset = mixture([Llama_Nemotron, OpenThoughts3])
train(model, dataset, max_length=32768, packing=True, epochs=5)
# Output: mid-trained model with enhanced reasoning

Key training features for mid-training:

  • Sequence packing: Packs multiple reasoning examples into 32k-token sequences
  • Liger kernel: Fused operations for memory-efficient long-sequence training
  • trust_remote_code: Required for custom SmolLM3 architecture
  • Multi-epoch: 5 epochs to deeply learn reasoning patterns

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment