Implementation:Microsoft DeepSpeedExamples DeepSpeed Initialize CIFAR

Metadata

Field	Value
Page Type	Implementation
Repository	Microsoft/DeepSpeedExamples
Title	DeepSpeed_Initialize_CIFAR
Type	Wrapper Doc
Source File	`training/cifar/cifar10_deepspeed.py`
Lines	117-163 (`get_ds_config`), 280-357 (`main` initialization sequence)
Implements	Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Engine_Init

Overview

Concrete usage of deepspeed.initialize() for the CIFAR-10 tutorial with optional MoE support.

Description

The DeepSpeed initialization in the CIFAR-10 example involves two coordinated components:

get_ds_config(args) (Lines 117-163) -- A factory function that builds the DeepSpeed JSON configuration dictionary from parsed CLI arguments. It maps user-facing arguments (--dtype, --stage) into the structured configuration that DeepSpeed expects.

The initialization sequence in main(args) (Lines 280-357) -- The orchestration code that sets up distributed training, creates the model, builds the config, and calls deepspeed.initialize() to produce the engine. This sequence also handles data preparation with rank-aware barriers to prevent download races.

The initialization call returns four objects: the model_engine (DeepSpeedEngine wrapping the model), the optimizer (created by DeepSpeed based on config), the trainloader (distributed DataLoader created from the training dataset), and a learning rate scheduler (unused in this example, captured as __).

Code Reference

get_ds_config (Lines 117-163)

File: training/cifar/cifar10_deepspeed.py

def get_ds_config(args):
    """Get the DeepSpeed configuration dictionary."""
    ds_config = {
        "train_batch_size": 16,
        "steps_per_print": 2000,
        "optimizer": {
            "type": "Adam",
            "params": {
                "lr": 0.001,
                "betas": [0.8, 0.999],
                "eps": 1e-8,
                "weight_decay": 3e-7,
            },
        },
        "scheduler": {
            "type": "WarmupLR",
            "params": {
                "warmup_min_lr": 0,
                "warmup_max_lr": 0.001,
                "warmup_num_steps": 1000,
            },
        },
        "gradient_clipping": 1.0,
        "prescale_gradients": False,
        "bf16": {"enabled": args.dtype == "bf16"},
        "fp16": {
            "enabled": args.dtype == "fp16",
            "fp16_master_weights_and_grads": False,
            "loss_scale": 0,
            "loss_scale_window": 500,
            "hysteresis": 2,
            "min_loss_scale": 1,
            "initial_scale_power": 15,
        },
        "wall_clock_breakdown": False,
        "zero_optimization": {
            "stage": args.stage,
            "allgather_partitions": True,
            "reduce_scatter": True,
            "allgather_bucket_size": 50000000,
            "reduce_bucket_size": 50000000,
            "overlap_comm": True,
            "contiguous_gradients": True,
            "cpu_offload": False,
        },
    }
    return ds_config

Initialization Sequence in main() (Lines 280-357)

File: training/cifar/cifar10_deepspeed.py

def main(args):
    # Initialize DeepSpeed distributed backend.
    deepspeed.init_distributed()
    _local_rank = int(os.environ.get("LOCAL_RANK"))
    get_accelerator().set_device(_local_rank)

    # Step 1. Data Preparation with rank-aware barriers.
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    )

    if torch.distributed.get_rank() != 0:
        # Might be downloading cifar data, let rank 0 download first.
        torch.distributed.barrier()

    trainset = torchvision.datasets.CIFAR10(
        root="./data", train=True, download=True, transform=transform
    )
    testset = torchvision.datasets.CIFAR10(
        root="./data", train=False, download=True, transform=transform
    )

    if torch.distributed.get_rank() == 0:
        # Cifar data is downloaded, indicate other ranks can proceed.
        torch.distributed.barrier()

    # Step 2. Define the network with DeepSpeed.
    net = Net(args)

    # Get list of parameters that require gradients.
    parameters = filter(lambda p: p.requires_grad, net.parameters())

    # If using MoE, create separate param groups for each expert.
    if args.moe_param_group:
        parameters = create_moe_param_groups(net)

    # Initialize DeepSpeed engine.
    ds_config = get_ds_config(args)
    model_engine, optimizer, trainloader, __ = deepspeed.initialize(
        args=args,
        model=net,
        model_parameters=parameters,
        training_data=trainset,
        config=ds_config,
    )

    # Get the local device name (str) and local rank (int).
    local_device = get_accelerator().device_name(model_engine.local_rank)
    local_rank = model_engine.local_rank

    # For float32, target_dtype will be None so no datatype conversion needed.
    target_dtype = None
    if model_engine.bfloat16_enabled():
        target_dtype = torch.bfloat16
    elif model_engine.fp16_enabled():
        target_dtype = torch.half

Signature

def get_ds_config(args: argparse.Namespace) -> dict:
    """Build DeepSpeed configuration dictionary from CLI arguments.

    Args:
        args: Parsed arguments containing dtype and stage settings.

    Returns:
        dict: DeepSpeed configuration with optimizer, scheduler, precision,
              and ZeRO settings.
    """

I/O Contract

get_ds_config

Direction	Name	Type	Description
Input	args	`argparse.Namespace`	Must contain `args.dtype` (str) and `args.stage` (int)
Output	ds_config	`dict`	DeepSpeed JSON-compatible configuration dictionary

deepspeed.initialize Call

Direction	Name	Type	Description
Input	args	`argparse.Namespace`	CLI arguments including DeepSpeed flags
Input	model	`nn.Module`	Raw PyTorch model (Net instance)
Input	model_parameters	`iterator` or `list[dict]`	Trainable parameters or MoE param groups
Input	training_data	`torch.utils.data.Dataset`	CIFAR-10 training dataset
Input	config	`dict`	DeepSpeed configuration from `get_ds_config()`
Output	model_engine	`DeepSpeedEngine`	Wrapped model with distributed training capabilities
Output	optimizer	`Optimizer`	DeepSpeed-managed optimizer (Adam)
Output	trainloader	`DataLoader`	Distributed data loader with `DistributedSampler`
Output	lr_scheduler	`LRScheduler` or `None`	Learning rate scheduler (WarmupLR)

Configuration Parameters

Optimizer (Adam)

Parameter	Value	Notes
type	Adam	DeepSpeed's fused Adam implementation
lr	0.001	Learning rate
betas	[0.8, 0.999]	Adam beta parameters (note: beta1=0.8 instead of typical 0.9)
eps	1e-8	Numerical stability epsilon
weight_decay	3e-7	L2 regularization

Scheduler (WarmupLR)

Parameter	Value	Notes
type	WarmupLR	Linear warmup from min to max LR
warmup_min_lr	0	Starting learning rate
warmup_max_lr	0.001	Target learning rate (matches optimizer LR)
warmup_num_steps	1000	Steps to ramp from min to max

FP16 Settings

Parameter	Value	Notes
enabled	`args.dtype == "fp16"`	Controlled by CLI
loss_scale	0	Dynamic loss scaling (0 = auto)
loss_scale_window	500	Window for scaling decisions
hysteresis	2	Delay before increasing scale
min_loss_scale	1	Floor for loss scale
initial_scale_power	15	Initial scale = 2^15 = 32768

ZeRO Optimization

Parameter	Value	Notes
stage	`args.stage`	ZeRO stage (0-3) from CLI
allgather_partitions	True	AllGather partitioned parameters
reduce_scatter	True	Use ReduceScatter for gradient reduction
allgather_bucket_size	50000000	Communication bucket size (50M elements)
reduce_bucket_size	50000000	Reduction bucket size (50M elements)
overlap_comm	True	Overlap communication with computation
contiguous_gradients	True	Pack gradients contiguously in memory
cpu_offload	False	Do not offload to CPU

Usage Example

# The standard initialization pattern:
args = add_argument()

# Build config from args
ds_config = get_ds_config(args)

# Create model
net = Net(args)
parameters = filter(lambda p: p.requires_grad, net.parameters())

# If MoE with ZeRO, need separate param groups
if args.moe_param_group:
    parameters = create_moe_param_groups(net)

# Initialize DeepSpeed -- replaces manual optimizer, scheduler, DDP, DataLoader
model_engine, optimizer, trainloader, _ = deepspeed.initialize(
    args=args,
    model=net,
    model_parameters=parameters,
    training_data=trainset,
    config=ds_config,
)

# Query engine for device and dtype info
local_device = get_accelerator().device_name(model_engine.local_rank)
target_dtype = torch.bfloat16 if model_engine.bfloat16_enabled() else \
               torch.half if model_engine.fp16_enabled() else None

Data Download Barrier Pattern

The initialization includes a rank-aware barrier pattern to prevent race conditions during dataset download:

Rank 0                    Rank 1..N
  |                          |
  |                     [barrier -- wait]
  |                          |
  [download CIFAR-10]        |
  |                          |
  [barrier -- signal]   [barrier -- proceed]
  |                          |
  [continue]            [load cached data]

This ensures only rank 0 downloads the data while other ranks wait, then all ranks proceed with the locally cached dataset.

Related Pages

Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Engine_Init -- The principle this implementation realizes
Implementation:Microsoft_DeepSpeedExamples_Add_Argument_CIFAR -- Produces the args consumed by this initialization
Implementation:Microsoft_DeepSpeedExamples_Net_DeepSpeed -- The model wrapped by the engine
Implementation:Microsoft_DeepSpeedExamples_Test_Function_CIFAR -- Uses the model_engine produced here
Environment:Microsoft_DeepSpeedExamples_CIFAR10_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment