Principle:Deepspeedai DeepSpeed DeepSpeed Configuration
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Configuration_Management, Memory_Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A configuration-driven approach to specifying distributed training parameters through a JSON schema that controls ZeRO optimization, mixed precision, batch sizing, and scheduling.
Description
DeepSpeed Configuration uses a JSON configuration file (or dictionary) to control all aspects of distributed training without requiring code changes. This includes:
- ZeRO stage selection (0-3) controlling optimizer state, gradient, and parameter partitioning
- Optimizer configuration (Adam, AdamW, LAMB, Muon, etc.) with per-parameter settings
- Mixed precision settings (fp16, bf16, AMP) with loss scaling policies
- Gradient accumulation steps for effective batch size scaling
- Batch sizing with automatic micro-batch calculation from world size
- Learning rate scheduling (WarmupLR, WarmupDecayLR, OneCycle, etc.)
- Offloading policies for CPU and NVMe offload of optimizer states and parameters
The configuration-driven design separates training infrastructure concerns from model code, enabling rapid experimentation by changing only the JSON file.
Usage
Create a JSON configuration file specifying the desired training parameters. Pass the file path (or a Python dictionary) to deepspeed.initialize() via the config parameter. The configuration is parsed and validated by the DeepSpeedConfig class, which resolves world size, batch size, and all optimization settings.
Theoretical Basis
Declarative configuration pattern -- separating what optimization strategies to use from how they are implemented. The JSON config acts as a contract between the user's training script and the DeepSpeed runtime, enabling reproducibility and easy experimentation.
Key configuration dimensions and their relationships:
- ZeRO stages control the memory-communication trade-off:
- Stage 0: No partitioning (standard data parallelism)
- Stage 1: Partition optimizer states across ranks
- Stage 2: Additionally partition gradients (reduce-scatter instead of allreduce)
- Stage 3: Additionally partition parameters (AllGather on demand)
- Batch size invariant: train_batch_size = micro_batch_per_gpu * gradient_accumulation_steps * data_parallel_world_size
- Mixed precision: fp16 or bf16 reduces memory by 2x for activations and gradients, with loss scaling to maintain numerical stability
Pseudo-code:
# Abstract configuration-driven training setup
config = {
"zero_optimization": {"stage": 2},
"fp16": {"enabled": True, "loss_scale": 0},
"optimizer": {"type": "Adam", "params": {"lr": 1e-4}},
"train_batch_size": 32,
"gradient_accumulation_steps": 4,
}
# Config is parsed once and drives all runtime behavior
engine, optimizer, _, _ = deepspeed.initialize(model=model, config=config)