Implementation:Microsoft DeepSpeedExamples DeepSpeed Initialize CIFAR
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation |
| Repository | Microsoft/DeepSpeedExamples |
| Title | DeepSpeed_Initialize_CIFAR |
| Type | Wrapper Doc |
| Source File | training/cifar/cifar10_deepspeed.py
|
| Lines | 117-163 (get_ds_config), 280-357 (main initialization sequence)
|
| Implements | Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Engine_Init |
Overview
Concrete usage of deepspeed.initialize() for the CIFAR-10 tutorial with optional MoE support.
Description
The DeepSpeed initialization in the CIFAR-10 example involves two coordinated components:
get_ds_config(args)(Lines 117-163) -- A factory function that builds the DeepSpeed JSON configuration dictionary from parsed CLI arguments. It maps user-facing arguments (--dtype,--stage) into the structured configuration that DeepSpeed expects.
- The initialization sequence in
main(args)(Lines 280-357) -- The orchestration code that sets up distributed training, creates the model, builds the config, and callsdeepspeed.initialize()to produce the engine. This sequence also handles data preparation with rank-aware barriers to prevent download races.
The initialization call returns four objects: the model_engine (DeepSpeedEngine wrapping the model), the optimizer (created by DeepSpeed based on config), the trainloader (distributed DataLoader created from the training dataset), and a learning rate scheduler (unused in this example, captured as __).
Code Reference
get_ds_config (Lines 117-163)
File: training/cifar/cifar10_deepspeed.py
def get_ds_config(args):
"""Get the DeepSpeed configuration dictionary."""
ds_config = {
"train_batch_size": 16,
"steps_per_print": 2000,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 3e-7,
},
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000,
},
},
"gradient_clipping": 1.0,
"prescale_gradients": False,
"bf16": {"enabled": args.dtype == "bf16"},
"fp16": {
"enabled": args.dtype == "fp16",
"fp16_master_weights_and_grads": False,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 15,
},
"wall_clock_breakdown": False,
"zero_optimization": {
"stage": args.stage,
"allgather_partitions": True,
"reduce_scatter": True,
"allgather_bucket_size": 50000000,
"reduce_bucket_size": 50000000,
"overlap_comm": True,
"contiguous_gradients": True,
"cpu_offload": False,
},
}
return ds_config
Initialization Sequence in main() (Lines 280-357)
File: training/cifar/cifar10_deepspeed.py
def main(args):
# Initialize DeepSpeed distributed backend.
deepspeed.init_distributed()
_local_rank = int(os.environ.get("LOCAL_RANK"))
get_accelerator().set_device(_local_rank)
# Step 1. Data Preparation with rank-aware barriers.
transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)
if torch.distributed.get_rank() != 0:
# Might be downloading cifar data, let rank 0 download first.
torch.distributed.barrier()
trainset = torchvision.datasets.CIFAR10(
root="./data", train=True, download=True, transform=transform
)
testset = torchvision.datasets.CIFAR10(
root="./data", train=False, download=True, transform=transform
)
if torch.distributed.get_rank() == 0:
# Cifar data is downloaded, indicate other ranks can proceed.
torch.distributed.barrier()
# Step 2. Define the network with DeepSpeed.
net = Net(args)
# Get list of parameters that require gradients.
parameters = filter(lambda p: p.requires_grad, net.parameters())
# If using MoE, create separate param groups for each expert.
if args.moe_param_group:
parameters = create_moe_param_groups(net)
# Initialize DeepSpeed engine.
ds_config = get_ds_config(args)
model_engine, optimizer, trainloader, __ = deepspeed.initialize(
args=args,
model=net,
model_parameters=parameters,
training_data=trainset,
config=ds_config,
)
# Get the local device name (str) and local rank (int).
local_device = get_accelerator().device_name(model_engine.local_rank)
local_rank = model_engine.local_rank
# For float32, target_dtype will be None so no datatype conversion needed.
target_dtype = None
if model_engine.bfloat16_enabled():
target_dtype = torch.bfloat16
elif model_engine.fp16_enabled():
target_dtype = torch.half
Signature
def get_ds_config(args: argparse.Namespace) -> dict:
"""Build DeepSpeed configuration dictionary from CLI arguments.
Args:
args: Parsed arguments containing dtype and stage settings.
Returns:
dict: DeepSpeed configuration with optimizer, scheduler, precision,
and ZeRO settings.
"""
I/O Contract
get_ds_config
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | args | argparse.Namespace |
Must contain args.dtype (str) and args.stage (int)
|
| Output | ds_config | dict |
DeepSpeed JSON-compatible configuration dictionary |
deepspeed.initialize Call
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | args | argparse.Namespace |
CLI arguments including DeepSpeed flags |
| Input | model | nn.Module |
Raw PyTorch model (Net instance) |
| Input | model_parameters | iterator or list[dict] |
Trainable parameters or MoE param groups |
| Input | training_data | torch.utils.data.Dataset |
CIFAR-10 training dataset |
| Input | config | dict |
DeepSpeed configuration from get_ds_config()
|
| Output | model_engine | DeepSpeedEngine |
Wrapped model with distributed training capabilities |
| Output | optimizer | Optimizer |
DeepSpeed-managed optimizer (Adam) |
| Output | trainloader | DataLoader |
Distributed data loader with DistributedSampler
|
| Output | lr_scheduler | LRScheduler or None |
Learning rate scheduler (WarmupLR) |
Configuration Parameters
Optimizer (Adam)
| Parameter | Value | Notes |
|---|---|---|
| type | Adam | DeepSpeed's fused Adam implementation |
| lr | 0.001 | Learning rate |
| betas | [0.8, 0.999] | Adam beta parameters (note: beta1=0.8 instead of typical 0.9) |
| eps | 1e-8 | Numerical stability epsilon |
| weight_decay | 3e-7 | L2 regularization |
Scheduler (WarmupLR)
| Parameter | Value | Notes |
|---|---|---|
| type | WarmupLR | Linear warmup from min to max LR |
| warmup_min_lr | 0 | Starting learning rate |
| warmup_max_lr | 0.001 | Target learning rate (matches optimizer LR) |
| warmup_num_steps | 1000 | Steps to ramp from min to max |
FP16 Settings
| Parameter | Value | Notes |
|---|---|---|
| enabled | args.dtype == "fp16" |
Controlled by CLI |
| loss_scale | 0 | Dynamic loss scaling (0 = auto) |
| loss_scale_window | 500 | Window for scaling decisions |
| hysteresis | 2 | Delay before increasing scale |
| min_loss_scale | 1 | Floor for loss scale |
| initial_scale_power | 15 | Initial scale = 2^15 = 32768 |
ZeRO Optimization
| Parameter | Value | Notes |
|---|---|---|
| stage | args.stage |
ZeRO stage (0-3) from CLI |
| allgather_partitions | True | AllGather partitioned parameters |
| reduce_scatter | True | Use ReduceScatter for gradient reduction |
| allgather_bucket_size | 50000000 | Communication bucket size (50M elements) |
| reduce_bucket_size | 50000000 | Reduction bucket size (50M elements) |
| overlap_comm | True | Overlap communication with computation |
| contiguous_gradients | True | Pack gradients contiguously in memory |
| cpu_offload | False | Do not offload to CPU |
Usage Example
# The standard initialization pattern:
args = add_argument()
# Build config from args
ds_config = get_ds_config(args)
# Create model
net = Net(args)
parameters = filter(lambda p: p.requires_grad, net.parameters())
# If MoE with ZeRO, need separate param groups
if args.moe_param_group:
parameters = create_moe_param_groups(net)
# Initialize DeepSpeed -- replaces manual optimizer, scheduler, DDP, DataLoader
model_engine, optimizer, trainloader, _ = deepspeed.initialize(
args=args,
model=net,
model_parameters=parameters,
training_data=trainset,
config=ds_config,
)
# Query engine for device and dtype info
local_device = get_accelerator().device_name(model_engine.local_rank)
target_dtype = torch.bfloat16 if model_engine.bfloat16_enabled() else \
torch.half if model_engine.fp16_enabled() else None
Data Download Barrier Pattern
The initialization includes a rank-aware barrier pattern to prevent race conditions during dataset download:
Rank 0 Rank 1..N | | | [barrier -- wait] | | [download CIFAR-10] | | | [barrier -- signal] [barrier -- proceed] | | [continue] [load cached data]
This ensures only rank 0 downloads the data while other ranks wait, then all ranks proceed with the locally cached dataset.
Related Pages
- Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Engine_Init -- The principle this implementation realizes
- Implementation:Microsoft_DeepSpeedExamples_Add_Argument_CIFAR -- Produces the
argsconsumed by this initialization - Implementation:Microsoft_DeepSpeedExamples_Net_DeepSpeed -- The model wrapped by the engine
- Implementation:Microsoft_DeepSpeedExamples_Test_Function_CIFAR -- Uses the
model_engineproduced here - Environment:Microsoft_DeepSpeedExamples_CIFAR10_Training_Environment