Implementation:Deepspeedai DeepSpeed ZeRO Init

Knowledge Sources	DeepSpeed ZeRO: Memory Optimizations Toward Training Trillion Parameter Models DeepSpeed ZeRO
Domains	Distributed_Training, Memory_Optimization, Model_Parallelism
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for partitioning model parameters during construction for ZeRO-3 training provided by the DeepSpeed library.

Description

The deepspeed.zero.Init context manager intercepts torch.nn.Module construction to automatically partition parameters across data-parallel ranks. It inherits from InsertPostInitMethodToModuleSubClasses, which monkey-patches the __init__ method of all nn.Module subclasses so that every new parameter created inside the context is immediately partitioned (sharded) across the data-parallel group.

Key features:

Automatic sharding: Parameters are partitioned as soon as they are created, so the full model never materializes on a single GPU
CPU/NVMe offloading: Supports offloading parameter shards to CPU or NVMe via the remote_device parameter
Quantized weights: Supports quantized weight storage via zero_quantized_weights for further memory savings
Parameter persistence: Small parameters below param_persistence_threshold can be kept on all ranks to avoid communication overhead
Configurable dtype: Parameters can be converted to a specified dtype (e.g., fp16, bf16) during partitioning

Usage

Wrap model instantiation with the Init context manager when using ZeRO Stage 3. The context can also accept a pre-constructed module to partition it after construction.

Code Reference

Source Location

Repository: DeepSpeed
File: deepspeed/runtime/zero/partition_parameters.py
Lines: 884-909 (class definition and __init__)

Signature

class Init(InsertPostInitMethodToModuleSubClasses):
    param_id = 0
    param_persistence_threshold = get_config_default(
        DeepSpeedZeroConfig, "param_persistence_threshold"
    )
    model_persistence_threshold = get_config_default(
        DeepSpeedZeroConfig, "model_persistence_threshold"
    )
    num_persisted_parameters = 0
    num_persisted_elements = 0
    apply_param_persistence = False

    def __init__(self,
                 module=None,
                 data_parallel_group=None,
                 mem_efficient_linear=True,
                 remote_device=None,
                 pin_memory=False,
                 config_dict_or_path=None,
                 config=None,
                 enabled=True,
                 dtype=None,
                 mpu=None,
                 zero_param_parallel_group=None,
                 zero_quantized_weights=False,
                 zero_quantized_nontrainable_weights=False,
                 sequence_data_parallel_group=None,
                 param_swapper=None,
                 tensor_overrides=DEFAULT_TENSOR_OVERRIDES):

Import

from deepspeed.runtime.zero.partition_parameters import Init

# Or via the public API:
import deepspeed
deepspeed.zero.Init

I/O Contract

Inputs

Name	Type	Required	Description
module	torch.nn.Module	No	If provided, partition the model as if it was constructed in the context
data_parallel_group	process group	No	The group of processes to partition among; defaults to all processes
mem_efficient_linear	bool	No	Replace torch.nn.functional.linear with a memory-efficient implementation (default: True)
remote_device	str	No	Initial device for model weights: cpu, nvme, or None for GPU
pin_memory	bool	No	Pin CPU memory for faster transfers (default: False)
config_dict_or_path	Union[str, dict]	No	DeepSpeed configuration file path or dictionary
config	DeepSpeedConfig	No	Pre-parsed DeepSpeed configuration object
enabled	bool	No	Enable or disable partitioning (default: True)
dtype	torch.dtype	No	Data type for parameters (e.g., torch.float16, torch.bfloat16)
mpu	object	No	Model parallelism unit
zero_quantized_weights	bool	No	Enable quantized weight storage (default: False)
zero_quantized_nontrainable_weights	bool	No	Enable quantized storage for non-trainable weights (default: False)

Outputs

Name	Type	Description
(context effect)	torch.nn.Module	Model with parameters automatically partitioned across data-parallel ranks; each rank holds 1/N of each parameter

Usage Examples

import deepspeed
import torch

# Basic usage: partition during model construction
with deepspeed.zero.Init():
    model = MyLargeModel(hidden_size=8192, num_layers=96)

# With CPU offloading for extremely large models
with deepspeed.zero.Init(remote_device="cpu", pin_memory=True):
    model = MyVeryLargeModel()

# With explicit config
with deepspeed.zero.Init(config_dict_or_path="ds_config.json"):
    model = MyLargeModel()

# Partition a pre-existing module
model = MyLargeModel()
with deepspeed.zero.Init(module=model):
    pass  # model is now partitioned

# Conditionally enable based on ZeRO stage
zero_stage = 3
with deepspeed.zero.Init(enabled=(zero_stage == 3)):
    model = MyModel()  # Only partitioned if stage 3

Related Pages

Implements Principle

Principle:Deepspeedai_DeepSpeed_ZeRO_Parameter_Partitioning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment