Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples DeepSpeed Initialize SuperOffload

From Leeroopedia
Revision as of 15:40, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_SuperOffload.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Metadata

Field Value
Page Type Implementation
Title DeepSpeed_Initialize_SuperOffload
Repository Microsoft/DeepSpeedExamples
Type Wrapper Doc (wraps deepspeed.initialize)
Code Reference File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 253-259
Import import deepspeed, from deepspeed.ops.adam import DeepSpeedCPUAdam
Related Principle Principle:Microsoft_DeepSpeedExamples_ZeRO3_CPU_Offload_Training

Overview

Concrete usage of deepspeed.initialize() with ZeRO-3 CPU offloading and DeepSpeedCPUAdam for SuperOffload fine-tuning. This implementation wraps the standard DeepSpeed initialization call and the creation of the CPU-optimized Adam optimizer.

Function: create_optimizer

Signature

def create_optimizer(model: AutoModelForCausalLM) -> Any:

Code Reference: File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 161-168

Description

Creates a DeepSpeedCPUAdam optimizer instance for all model parameters. This optimizer is specifically designed for CPU-offloaded training and provides highly optimized Adam updates using SIMD instructions on the CPU.

Implementation

def create_optimizer(model: AutoModelForCausalLM) -> Any:
    from deepspeed.ops.adam import DeepSpeedCPUAdam
    optimizer = DeepSpeedCPUAdam(
        model.parameters(),
        lr=DEFAULT_OPTIMIZER_LR,
        betas=DEFAULT_OPTIMIZER_BETAS
    )
    return optimizer

I/O Contract

Parameter Type Description
model AutoModelForCausalLM The loaded HuggingFace model whose parameters will be optimized

Returns: DeepSpeedCPUAdam optimizer instance.

Constants used:

  • DEFAULT_OPTIMIZER_LR = 0.001 (overridden by DeepSpeed config at runtime)
  • DEFAULT_OPTIMIZER_BETAS = (0.9, 0.999)

DeepSpeed Initialization Call

Code Reference

File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 253-259

Implementation

# Initialize DeepSpeed
model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(
    args=args,
    model=model,
    optimizer=optimizer,
    training_data=tokenized_dataset,
    collate_fn=default_data_collator
)

I/O Contract

Inputs:

Parameter Type Description
args argparse.Namespace Parsed command-line arguments (must include --deepspeed_config pointing to the JSON config file)
model AutoModelForCausalLM The loaded and configured model (with gradient checkpointing enabled)
optimizer DeepSpeedCPUAdam The CPU-optimized Adam optimizer
training_data Dataset The tokenized HuggingFace Dataset (used by DeepSpeed for distributed sampling)
collate_fn Callable Data collation function (default_data_collator from transformers)

Outputs:

Return Value Type Description
model_engine DeepSpeedEngine The wrapped model with distributed training capabilities
optimizer DeepSpeedCPUAdam The optimizer (potentially wrapped by DeepSpeed)
train_dataloader DataLoader The distributed-aware DataLoader (potentially modified by DeepSpeed for distributed sampling)
_ LRScheduler Learning rate scheduler (unused in this implementation, discarded)

DeepSpeed JSON Config Structure

The deepspeed.initialize() call reads the configuration from the JSON file specified by --deepspeed_config. For SuperOffload, the config has this structure:

{
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "bf16": { "enabled": true },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_bucket_size": 4e8,
        "sub_group_size": 4e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 0.90,
            "super_offload": true,
            "cpuadam_cores_perc": 0.90
        }
    },
    "wall_clock_breakdown": true
}

What deepspeed.initialize() Does

The deepspeed.initialize() call performs the following operations:

  1. Distributed initialization -- Sets up the distributed process group (NCCL backend) if not already initialized.
  2. Model wrapping -- Wraps the model in a DeepSpeedEngine that intercepts forward, backward, and step calls.
  3. Parameter partitioning -- Partitions all model parameters across GPUs according to ZeRO Stage 3 rules. Each GPU retains only 1/N of each parameter tensor.
  4. Optimizer state offloading -- Moves optimizer states (momentum, variance for Adam) to CPU RAM with pinned memory for efficient transfers.
  5. DataLoader creation -- Creates a distributed-aware DataLoader with the DistributedSampler to ensure each GPU processes different data.
  6. Communication setup -- Configures reduce buckets and sub-groups for efficient gradient reduction.

Full Initialization Sequence in main()

The complete initialization sequence as it appears in the main() function (Lines 239-263):

# Step 1: Load tokenizer and model
tokenizer = load_tokenizer(args.model_name, logger)
model = load_model(args.model_name, args.attn_implementation, logger)

# Step 2: Optional MoE leaf module configuration
if args.leaf_module:
    from deepspeed.utils import set_z3_leaf_modules
    logger.debug(f"Setting leaf_module to: {args.leaf_module}")
    set_z3_leaf_modules(model, [args.leaf_module])

# Step 3: Configure model for training
setup_model_training(model, args.activation_checkpointing, logger)

# Step 4: Create CPU-optimized optimizer
optimizer = create_optimizer(model)

# Step 5: Load and preprocess dataset
tokenized_dataset, train_dataloader = load_and_preprocess_dataset(
    args.dataset_name, args.dataset_percentage, tokenizer, args.max_length, logger
)

# Step 6: Initialize DeepSpeed engine
model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(
    args=args,
    model=model,
    optimizer=optimizer,
    training_data=tokenized_dataset,
    collate_fn=default_data_collator
)

# Step 7: Re-initialize logger with distributed rank
logger = setup_logger(rank=dist.get_rank(), log_level=args.log_level)

Usage Example

import argparse
import deepspeed
from deepspeed.ops.adam import DeepSpeedCPUAdam
from transformers import AutoModelForCausalLM, default_data_collator

# Assume model and tokenized_dataset are already loaded
optimizer = DeepSpeedCPUAdam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# args must contain --deepspeed_config pointing to the JSON config
model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(
    args=args,
    model=model,
    optimizer=optimizer,
    training_data=tokenized_dataset,
    collate_fn=default_data_collator
)

# model_engine is now ready for training
model_engine.train()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment