Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepspeedai DeepSpeed DeepSpeed Configuration

From Leeroopedia


Knowledge Sources
Domains Distributed_Training, Configuration_Management, Memory_Optimization
Last Updated 2026-02-09 00:00 GMT

Overview

A configuration-driven approach to specifying distributed training parameters through a JSON schema that controls ZeRO optimization, mixed precision, batch sizing, and scheduling.

Description

DeepSpeed Configuration uses a JSON configuration file (or dictionary) to control all aspects of distributed training without requiring code changes. This includes:

  • ZeRO stage selection (0-3) controlling optimizer state, gradient, and parameter partitioning
  • Optimizer configuration (Adam, AdamW, LAMB, Muon, etc.) with per-parameter settings
  • Mixed precision settings (fp16, bf16, AMP) with loss scaling policies
  • Gradient accumulation steps for effective batch size scaling
  • Batch sizing with automatic micro-batch calculation from world size
  • Learning rate scheduling (WarmupLR, WarmupDecayLR, OneCycle, etc.)
  • Offloading policies for CPU and NVMe offload of optimizer states and parameters

The configuration-driven design separates training infrastructure concerns from model code, enabling rapid experimentation by changing only the JSON file.

Usage

Create a JSON configuration file specifying the desired training parameters. Pass the file path (or a Python dictionary) to deepspeed.initialize() via the config parameter. The configuration is parsed and validated by the DeepSpeedConfig class, which resolves world size, batch size, and all optimization settings.

Theoretical Basis

Declarative configuration pattern -- separating what optimization strategies to use from how they are implemented. The JSON config acts as a contract between the user's training script and the DeepSpeed runtime, enabling reproducibility and easy experimentation.

Key configuration dimensions and their relationships:

  • ZeRO stages control the memory-communication trade-off:
    • Stage 0: No partitioning (standard data parallelism)
    • Stage 1: Partition optimizer states across ranks
    • Stage 2: Additionally partition gradients (reduce-scatter instead of allreduce)
    • Stage 3: Additionally partition parameters (AllGather on demand)
  • Batch size invariant: train_batch_size = micro_batch_per_gpu * gradient_accumulation_steps * data_parallel_world_size
  • Mixed precision: fp16 or bf16 reduces memory by 2x for activations and gradients, with loss scaling to maintain numerical stability

Pseudo-code:

# Abstract configuration-driven training setup
config = {
    "zero_optimization": {"stage": 2},
    "fp16": {"enabled": True, "loss_scale": 0},
    "optimizer": {"type": "Adam", "params": {"lr": 1e-4}},
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
}
# Config is parsed once and drives all runtime behavior
engine, optimizer, _, _ = deepspeed.initialize(model=model, config=config)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment