Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples DeepSpeed Initialize CIFAR

From Leeroopedia
Revision as of 15:40, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_CIFAR.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Metadata

Field Value
Page Type Implementation
Repository Microsoft/DeepSpeedExamples
Title DeepSpeed_Initialize_CIFAR
Type Wrapper Doc
Source File training/cifar/cifar10_deepspeed.py
Lines 117-163 (get_ds_config), 280-357 (main initialization sequence)
Implements Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Engine_Init

Overview

Concrete usage of deepspeed.initialize() for the CIFAR-10 tutorial with optional MoE support.

Description

The DeepSpeed initialization in the CIFAR-10 example involves two coordinated components:

  1. get_ds_config(args) (Lines 117-163) -- A factory function that builds the DeepSpeed JSON configuration dictionary from parsed CLI arguments. It maps user-facing arguments (--dtype, --stage) into the structured configuration that DeepSpeed expects.
  1. The initialization sequence in main(args) (Lines 280-357) -- The orchestration code that sets up distributed training, creates the model, builds the config, and calls deepspeed.initialize() to produce the engine. This sequence also handles data preparation with rank-aware barriers to prevent download races.

The initialization call returns four objects: the model_engine (DeepSpeedEngine wrapping the model), the optimizer (created by DeepSpeed based on config), the trainloader (distributed DataLoader created from the training dataset), and a learning rate scheduler (unused in this example, captured as __).

Code Reference

get_ds_config (Lines 117-163)

File: training/cifar/cifar10_deepspeed.py

def get_ds_config(args):
    """Get the DeepSpeed configuration dictionary."""
    ds_config = {
        "train_batch_size": 16,
        "steps_per_print": 2000,
        "optimizer": {
            "type": "Adam",
            "params": {
                "lr": 0.001,
                "betas": [0.8, 0.999],
                "eps": 1e-8,
                "weight_decay": 3e-7,
            },
        },
        "scheduler": {
            "type": "WarmupLR",
            "params": {
                "warmup_min_lr": 0,
                "warmup_max_lr": 0.001,
                "warmup_num_steps": 1000,
            },
        },
        "gradient_clipping": 1.0,
        "prescale_gradients": False,
        "bf16": {"enabled": args.dtype == "bf16"},
        "fp16": {
            "enabled": args.dtype == "fp16",
            "fp16_master_weights_and_grads": False,
            "loss_scale": 0,
            "loss_scale_window": 500,
            "hysteresis": 2,
            "min_loss_scale": 1,
            "initial_scale_power": 15,
        },
        "wall_clock_breakdown": False,
        "zero_optimization": {
            "stage": args.stage,
            "allgather_partitions": True,
            "reduce_scatter": True,
            "allgather_bucket_size": 50000000,
            "reduce_bucket_size": 50000000,
            "overlap_comm": True,
            "contiguous_gradients": True,
            "cpu_offload": False,
        },
    }
    return ds_config

Initialization Sequence in main() (Lines 280-357)

File: training/cifar/cifar10_deepspeed.py

def main(args):
    # Initialize DeepSpeed distributed backend.
    deepspeed.init_distributed()
    _local_rank = int(os.environ.get("LOCAL_RANK"))
    get_accelerator().set_device(_local_rank)

    # Step 1. Data Preparation with rank-aware barriers.
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )

    if torch.distributed.get_rank() != 0:
        # Might be downloading cifar data, let rank 0 download first.
        torch.distributed.barrier()

    trainset = torchvision.datasets.CIFAR10(
        root="./data", train=True, download=True, transform=transform
    )
    testset = torchvision.datasets.CIFAR10(
        root="./data", train=False, download=True, transform=transform
    )

    if torch.distributed.get_rank() == 0:
        # Cifar data is downloaded, indicate other ranks can proceed.
        torch.distributed.barrier()

    # Step 2. Define the network with DeepSpeed.
    net = Net(args)

    # Get list of parameters that require gradients.
    parameters = filter(lambda p: p.requires_grad, net.parameters())

    # If using MoE, create separate param groups for each expert.
    if args.moe_param_group:
        parameters = create_moe_param_groups(net)

    # Initialize DeepSpeed engine.
    ds_config = get_ds_config(args)
    model_engine, optimizer, trainloader, __ = deepspeed.initialize(
        args=args,
        model=net,
        model_parameters=parameters,
        training_data=trainset,
        config=ds_config,
    )

    # Get the local device name (str) and local rank (int).
    local_device = get_accelerator().device_name(model_engine.local_rank)
    local_rank = model_engine.local_rank

    # For float32, target_dtype will be None so no datatype conversion needed.
    target_dtype = None
    if model_engine.bfloat16_enabled():
        target_dtype = torch.bfloat16
    elif model_engine.fp16_enabled():
        target_dtype = torch.half

Signature

def get_ds_config(args: argparse.Namespace) -> dict:
    """Build DeepSpeed configuration dictionary from CLI arguments.

    Args:
        args: Parsed arguments containing dtype and stage settings.

    Returns:
        dict: DeepSpeed configuration with optimizer, scheduler, precision,
              and ZeRO settings.
    """

I/O Contract

get_ds_config

Direction Name Type Description
Input args argparse.Namespace Must contain args.dtype (str) and args.stage (int)
Output ds_config dict DeepSpeed JSON-compatible configuration dictionary

deepspeed.initialize Call

Direction Name Type Description
Input args argparse.Namespace CLI arguments including DeepSpeed flags
Input model nn.Module Raw PyTorch model (Net instance)
Input model_parameters iterator or list[dict] Trainable parameters or MoE param groups
Input training_data torch.utils.data.Dataset CIFAR-10 training dataset
Input config dict DeepSpeed configuration from get_ds_config()
Output model_engine DeepSpeedEngine Wrapped model with distributed training capabilities
Output optimizer Optimizer DeepSpeed-managed optimizer (Adam)
Output trainloader DataLoader Distributed data loader with DistributedSampler
Output lr_scheduler LRScheduler or None Learning rate scheduler (WarmupLR)

Configuration Parameters

Optimizer (Adam)

Parameter Value Notes
type Adam DeepSpeed's fused Adam implementation
lr 0.001 Learning rate
betas [0.8, 0.999] Adam beta parameters (note: beta1=0.8 instead of typical 0.9)
eps 1e-8 Numerical stability epsilon
weight_decay 3e-7 L2 regularization

Scheduler (WarmupLR)

Parameter Value Notes
type WarmupLR Linear warmup from min to max LR
warmup_min_lr 0 Starting learning rate
warmup_max_lr 0.001 Target learning rate (matches optimizer LR)
warmup_num_steps 1000 Steps to ramp from min to max

FP16 Settings

Parameter Value Notes
enabled args.dtype == "fp16" Controlled by CLI
loss_scale 0 Dynamic loss scaling (0 = auto)
loss_scale_window 500 Window for scaling decisions
hysteresis 2 Delay before increasing scale
min_loss_scale 1 Floor for loss scale
initial_scale_power 15 Initial scale = 2^15 = 32768

ZeRO Optimization

Parameter Value Notes
stage args.stage ZeRO stage (0-3) from CLI
allgather_partitions True AllGather partitioned parameters
reduce_scatter True Use ReduceScatter for gradient reduction
allgather_bucket_size 50000000 Communication bucket size (50M elements)
reduce_bucket_size 50000000 Reduction bucket size (50M elements)
overlap_comm True Overlap communication with computation
contiguous_gradients True Pack gradients contiguously in memory
cpu_offload False Do not offload to CPU

Usage Example

# The standard initialization pattern:
args = add_argument()

# Build config from args
ds_config = get_ds_config(args)

# Create model
net = Net(args)
parameters = filter(lambda p: p.requires_grad, net.parameters())

# If MoE with ZeRO, need separate param groups
if args.moe_param_group:
    parameters = create_moe_param_groups(net)

# Initialize DeepSpeed -- replaces manual optimizer, scheduler, DDP, DataLoader
model_engine, optimizer, trainloader, _ = deepspeed.initialize(
    args=args,
    model=net,
    model_parameters=parameters,
    training_data=trainset,
    config=ds_config,
)

# Query engine for device and dtype info
local_device = get_accelerator().device_name(model_engine.local_rank)
target_dtype = torch.bfloat16 if model_engine.bfloat16_enabled() else \
               torch.half if model_engine.fp16_enabled() else None

Data Download Barrier Pattern

The initialization includes a rank-aware barrier pattern to prevent race conditions during dataset download:

Rank 0                    Rank 1..N
  |                          |
  |                     [barrier -- wait]
  |                          |
  [download CIFAR-10]        |
  |                          |
  [barrier -- signal]   [barrier -- proceed]
  |                          |
  [continue]            [load cached data]

This ensures only rank 0 downloads the data while other ranks wait, then all ranks proceed with the locally cached dataset.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment