Implementation:Deepspeedai DeepSpeed ZeRO Init
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Memory_Optimization, Model_Parallelism |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for partitioning model parameters during construction for ZeRO-3 training provided by the DeepSpeed library.
Description
The deepspeed.zero.Init context manager intercepts torch.nn.Module construction to automatically partition parameters across data-parallel ranks. It inherits from InsertPostInitMethodToModuleSubClasses, which monkey-patches the __init__ method of all nn.Module subclasses so that every new parameter created inside the context is immediately partitioned (sharded) across the data-parallel group.
Key features:
- Automatic sharding: Parameters are partitioned as soon as they are created, so the full model never materializes on a single GPU
- CPU/NVMe offloading: Supports offloading parameter shards to CPU or NVMe via the remote_device parameter
- Quantized weights: Supports quantized weight storage via zero_quantized_weights for further memory savings
- Parameter persistence: Small parameters below param_persistence_threshold can be kept on all ranks to avoid communication overhead
- Configurable dtype: Parameters can be converted to a specified dtype (e.g., fp16, bf16) during partitioning
Usage
Wrap model instantiation with the Init context manager when using ZeRO Stage 3. The context can also accept a pre-constructed module to partition it after construction.
Code Reference
Source Location
- Repository: DeepSpeed
- File: deepspeed/runtime/zero/partition_parameters.py
- Lines: 884-909 (class definition and __init__)
Signature
class Init(InsertPostInitMethodToModuleSubClasses):
param_id = 0
param_persistence_threshold = get_config_default(
DeepSpeedZeroConfig, "param_persistence_threshold"
)
model_persistence_threshold = get_config_default(
DeepSpeedZeroConfig, "model_persistence_threshold"
)
num_persisted_parameters = 0
num_persisted_elements = 0
apply_param_persistence = False
def __init__(self,
module=None,
data_parallel_group=None,
mem_efficient_linear=True,
remote_device=None,
pin_memory=False,
config_dict_or_path=None,
config=None,
enabled=True,
dtype=None,
mpu=None,
zero_param_parallel_group=None,
zero_quantized_weights=False,
zero_quantized_nontrainable_weights=False,
sequence_data_parallel_group=None,
param_swapper=None,
tensor_overrides=DEFAULT_TENSOR_OVERRIDES):
Import
from deepspeed.runtime.zero.partition_parameters import Init
# Or via the public API:
import deepspeed
deepspeed.zero.Init
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| module | torch.nn.Module | No | If provided, partition the model as if it was constructed in the context |
| data_parallel_group | process group | No | The group of processes to partition among; defaults to all processes |
| mem_efficient_linear | bool | No | Replace torch.nn.functional.linear with a memory-efficient implementation (default: True) |
| remote_device | str | No | Initial device for model weights: cpu, nvme, or None for GPU |
| pin_memory | bool | No | Pin CPU memory for faster transfers (default: False) |
| config_dict_or_path | Union[str, dict] | No | DeepSpeed configuration file path or dictionary |
| config | DeepSpeedConfig | No | Pre-parsed DeepSpeed configuration object |
| enabled | bool | No | Enable or disable partitioning (default: True) |
| dtype | torch.dtype | No | Data type for parameters (e.g., torch.float16, torch.bfloat16) |
| mpu | object | No | Model parallelism unit |
| zero_quantized_weights | bool | No | Enable quantized weight storage (default: False) |
| zero_quantized_nontrainable_weights | bool | No | Enable quantized storage for non-trainable weights (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| (context effect) | torch.nn.Module | Model with parameters automatically partitioned across data-parallel ranks; each rank holds 1/N of each parameter |
Usage Examples
import deepspeed
import torch
# Basic usage: partition during model construction
with deepspeed.zero.Init():
model = MyLargeModel(hidden_size=8192, num_layers=96)
# With CPU offloading for extremely large models
with deepspeed.zero.Init(remote_device="cpu", pin_memory=True):
model = MyVeryLargeModel()
# With explicit config
with deepspeed.zero.Init(config_dict_or_path="ds_config.json"):
model = MyLargeModel()
# Partition a pre-existing module
model = MyLargeModel()
with deepspeed.zero.Init(module=model):
pass # model is now partitioned
# Conditionally enable based on ZeRO stage
zero_stage = 3
with deepspeed.zero.Init(enabled=(zero_stage == 3)):
model = MyModel() # Only partitioned if stage 3