Principle:Deepspeedai DeepSpeed DeepSpeed Configuration

Knowledge Sources	DeepSpeed DeepSpeed Configuration ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Domains	Distributed_Training, Configuration_Management, Memory_Optimization
Last Updated	2026-02-09 00:00 GMT

Overview

A configuration-driven approach to specifying distributed training parameters through a JSON schema that controls ZeRO optimization, mixed precision, batch sizing, and scheduling.

Description

DeepSpeed Configuration uses a JSON configuration file (or dictionary) to control all aspects of distributed training without requiring code changes. This includes:

ZeRO stage selection (0-3) controlling optimizer state, gradient, and parameter partitioning
Optimizer configuration (Adam, AdamW, LAMB, Muon, etc.) with per-parameter settings
Mixed precision settings (fp16, bf16, AMP) with loss scaling policies
Gradient accumulation steps for effective batch size scaling
Batch sizing with automatic micro-batch calculation from world size
Learning rate scheduling (WarmupLR, WarmupDecayLR, OneCycle, etc.)
Offloading policies for CPU and NVMe offload of optimizer states and parameters

The configuration-driven design separates training infrastructure concerns from model code, enabling rapid experimentation by changing only the JSON file.

Usage

Create a JSON configuration file specifying the desired training parameters. Pass the file path (or a Python dictionary) to deepspeed.initialize() via the config parameter. The configuration is parsed and validated by the DeepSpeedConfig class, which resolves world size, batch size, and all optimization settings.

Theoretical Basis

Declarative configuration pattern -- separating what optimization strategies to use from how they are implemented. The JSON config acts as a contract between the user's training script and the DeepSpeed runtime, enabling reproducibility and easy experimentation.

Key configuration dimensions and their relationships:

ZeRO stages control the memory-communication trade-off:
- Stage 0: No partitioning (standard data parallelism)
- Stage 1: Partition optimizer states across ranks
- Stage 2: Additionally partition gradients (reduce-scatter instead of allreduce)
- Stage 3: Additionally partition parameters (AllGather on demand)
Batch size invariant: train_batch_size = micro_batch_per_gpu * gradient_accumulation_steps * data_parallel_world_size
Mixed precision: fp16 or bf16 reduces memory by 2x for activations and gradients, with loss scaling to maintain numerical stability

Pseudo-code:

# Abstract configuration-driven training setup
config = {
    "zero_optimization": {"stage": 2},
    "fp16": {"enabled": True, "loss_scale": 0},
    "optimizer": {"type": "Adam", "params": {"lr": 1e-4}},
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
}
# Config is parsed once and drives all runtime behavior
engine, optimizer, _, _ = deepspeed.initialize(model=model, config=config)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment