Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed ZeRO Init

From Leeroopedia


Knowledge Sources
Domains Distributed_Training, Memory_Optimization, Model_Parallelism
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for partitioning model parameters during construction for ZeRO-3 training provided by the DeepSpeed library.

Description

The deepspeed.zero.Init context manager intercepts torch.nn.Module construction to automatically partition parameters across data-parallel ranks. It inherits from InsertPostInitMethodToModuleSubClasses, which monkey-patches the __init__ method of all nn.Module subclasses so that every new parameter created inside the context is immediately partitioned (sharded) across the data-parallel group.

Key features:

  • Automatic sharding: Parameters are partitioned as soon as they are created, so the full model never materializes on a single GPU
  • CPU/NVMe offloading: Supports offloading parameter shards to CPU or NVMe via the remote_device parameter
  • Quantized weights: Supports quantized weight storage via zero_quantized_weights for further memory savings
  • Parameter persistence: Small parameters below param_persistence_threshold can be kept on all ranks to avoid communication overhead
  • Configurable dtype: Parameters can be converted to a specified dtype (e.g., fp16, bf16) during partitioning

Usage

Wrap model instantiation with the Init context manager when using ZeRO Stage 3. The context can also accept a pre-constructed module to partition it after construction.

Code Reference

Source Location

  • Repository: DeepSpeed
  • File: deepspeed/runtime/zero/partition_parameters.py
  • Lines: 884-909 (class definition and __init__)

Signature

class Init(InsertPostInitMethodToModuleSubClasses):
    param_id = 0
    param_persistence_threshold = get_config_default(
        DeepSpeedZeroConfig, "param_persistence_threshold"
    )
    model_persistence_threshold = get_config_default(
        DeepSpeedZeroConfig, "model_persistence_threshold"
    )
    num_persisted_parameters = 0
    num_persisted_elements = 0
    apply_param_persistence = False

    def __init__(self,
                 module=None,
                 data_parallel_group=None,
                 mem_efficient_linear=True,
                 remote_device=None,
                 pin_memory=False,
                 config_dict_or_path=None,
                 config=None,
                 enabled=True,
                 dtype=None,
                 mpu=None,
                 zero_param_parallel_group=None,
                 zero_quantized_weights=False,
                 zero_quantized_nontrainable_weights=False,
                 sequence_data_parallel_group=None,
                 param_swapper=None,
                 tensor_overrides=DEFAULT_TENSOR_OVERRIDES):

Import

from deepspeed.runtime.zero.partition_parameters import Init

# Or via the public API:
import deepspeed
deepspeed.zero.Init

I/O Contract

Inputs

Name Type Required Description
module torch.nn.Module No If provided, partition the model as if it was constructed in the context
data_parallel_group process group No The group of processes to partition among; defaults to all processes
mem_efficient_linear bool No Replace torch.nn.functional.linear with a memory-efficient implementation (default: True)
remote_device str No Initial device for model weights: cpu, nvme, or None for GPU
pin_memory bool No Pin CPU memory for faster transfers (default: False)
config_dict_or_path Union[str, dict] No DeepSpeed configuration file path or dictionary
config DeepSpeedConfig No Pre-parsed DeepSpeed configuration object
enabled bool No Enable or disable partitioning (default: True)
dtype torch.dtype No Data type for parameters (e.g., torch.float16, torch.bfloat16)
mpu object No Model parallelism unit
zero_quantized_weights bool No Enable quantized weight storage (default: False)
zero_quantized_nontrainable_weights bool No Enable quantized storage for non-trainable weights (default: False)

Outputs

Name Type Description
(context effect) torch.nn.Module Model with parameters automatically partitioned across data-parallel ranks; each rank holds 1/N of each parameter

Usage Examples

import deepspeed
import torch

# Basic usage: partition during model construction
with deepspeed.zero.Init():
    model = MyLargeModel(hidden_size=8192, num_layers=96)

# With CPU offloading for extremely large models
with deepspeed.zero.Init(remote_device="cpu", pin_memory=True):
    model = MyVeryLargeModel()

# With explicit config
with deepspeed.zero.Init(config_dict_or_path="ds_config.json"):
    model = MyLargeModel()

# Partition a pre-existing module
model = MyLargeModel()
with deepspeed.zero.Init(module=model):
    pass  # model is now partitioned

# Conditionally enable based on ZeRO stage
zero_stage = 3
with deepspeed.zero.Init(enabled=(zero_stage == 3)):
    model = MyModel()  # Only partitioned if stage 3

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment