Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft DeepSpeedExamples ZeRO3 CPU Offload Training

From Leeroopedia
Revision as of 17:17, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Microsoft_DeepSpeedExamples_ZeRO3_CPU_Offload_Training.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Metadata

Field Value
Page Type Principle
Title ZeRO3_CPU_Offload_Training
Repository Microsoft/DeepSpeedExamples
Sources Paper: ZeRO https://arxiv.org/abs/1910.02054 ; Paper: ZeRO-Offload https://arxiv.org/abs/2101.06840
Domains Distributed_Training, Memory_Optimization
Status Active
Related Implementation Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_SuperOffload

Overview

A distributed training technique that partitions all model states across GPUs and offloads to CPU memory, enabling fine-tuning of models that exceed total GPU memory.

Description

ZeRO (Zero Redundancy Optimizer) Stage 3 with CPU offloading is the core distributed training strategy used by SuperOffload. It addresses the fundamental memory bottleneck of fine-tuning large language models by combining two complementary techniques:

  • ZeRO Stage 3 partitioning -- Partitions parameters, gradients, and optimizer states across all available GPUs. Each GPU holds only 1/N of each model state (where N is the number of GPUs), eliminating redundant copies.
  • CPU offloading -- Moves optimizer states (and optionally parameters) to CPU RAM, using the CPU as overflow memory. This is particularly effective on Superchip architectures (GH200, GB200) where CPU-GPU bandwidth is exceptionally high.

The deepspeed.initialize() call wraps the model into a DeepSpeedEngine that manages all distributed operations transparently. From the user's perspective, the model behaves like a standard PyTorch model, but all parameter gathering, gradient reduction, and optimizer stepping happen across devices automatically.

DeepSpeedCPUAdam

The standard PyTorch Adam optimizer is not efficient for CPU-offloaded training because it requires transferring gradients to CPU, computing updates on CPU, and transferring updated parameters back to GPU. DeepSpeedCPUAdam is a highly optimized CPU Adam implementation that:

  • Uses SIMD (AVX2/AVX-512) instructions for vectorized math on CPU
  • Supports asynchronous parameter updates overlapped with GPU computation
  • Allocates a configurable percentage of CPU cores (cpuadam_cores_perc) for optimizer work
  • Reduces CPU optimizer time by up to 5-6x compared to standard PyTorch Adam on CPU

Theoretical Basis

Memory Analysis

For a model with P parameters and N GPUs, the memory breakdown per device is:

Component Standard Data Parallel ZeRO Stage 3 ZeRO-3 + CPU Offload
Parameters P * bytes_per_param P / N * bytes_per_param Gathered on demand (transient)
Gradients P * bytes_per_param P / N * bytes_per_param P / N * bytes_per_param (transient)
Optimizer States P * 12 bytes (Adam FP32) P / N * 12 bytes Offloaded to CPU RAM
GPU Memory ~16P bytes ~16P/N bytes Activations + temp buffers only

With ZeRO-3 + CPU offload, GPU memory per device is approximately equal to activation memory plus temporary communication buffers. All persistent model parameters and optimizer states reside in CPU RAM.

Communication Pattern

ZeRO Stage 3 introduces the following communication during training:

  • All-gather before forward/backward -- Each GPU gathers the full parameter tensor it needs for the current layer from all other GPUs. Parameters are gathered layer-by-layer (or sub-group by sub-group) and released after use.
  • Reduce-scatter after backward -- Gradients are reduced across GPUs and each GPU retains only its 1/N partition of the reduced gradient.
  • CPU-GPU transfers -- Optimizer states are read from CPU, updated using DeepSpeedCPUAdam, and updated parameters are written back to the GPU partition.

The communication volume per training step is:

Forward:  ~2P bytes (all-gather parameters, layer by layer)
Backward: ~2P bytes (all-gather parameters) + ~2P bytes (reduce-scatter gradients)
Total:    ~6P bytes per step

SuperOffload Optimization

SuperOffload extends ZeRO-3 CPU offloading with:

  • Overlapped CPU-GPU execution -- CPU optimizer steps execute concurrently with GPU forward/backward computation.
  • Configurable offload ratio -- The ratio parameter (0.0-1.0) controls what fraction of optimizer work is performed on CPU vs. GPU.
  • CPU core allocation -- The cpuadam_cores_perc parameter controls what percentage of CPU cores are dedicated to the CPUAdam optimizer.

DeepSpeed Configuration

The ZeRO-3 + CPU offload configuration is specified in a JSON config:

{
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "bf16": { "enabled": true },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_bucket_size": 4e8,
        "sub_group_size": 4e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 0.90,
            "super_offload": true,
            "cpuadam_cores_perc": 0.90
        }
    },
    "wall_clock_breakdown": true
}

Configuration Parameters

Parameter Description Impact
stage: 3 Full parameter, gradient, and optimizer state partitioning Maximum memory savings
overlap_comm Overlap communication with computation Set to false for SuperOffload (different overlap strategy)
reduce_bucket_size Size of gradient reduce-scatter buckets Larger = fewer communications but more memory
sub_group_size Sub-group size for parameter partitioning Controls granularity of parameter gathering
offload_optimizer.device Where to store optimizer states cpu for CPU offloading
offload_optimizer.pin_memory Use pinned memory for CPU-GPU transfers true for faster transfers
offload_optimizer.ratio Fraction of optimizer work on CPU 0.80-0.90 typical
offload_optimizer.super_offload Enable SuperOffload engine true to activate
offload_optimizer.cpuadam_cores_perc CPU cores allocated for CPUAdam 0.90 typical

Initialization Pattern

The initialization follows this sequence:

  1. Create a DeepSpeedCPUAdam optimizer with the model parameters.
  2. Call deepspeed.initialize() with the model, optimizer, training data, and collate function.
  3. The returned model_engine wraps the original model with distributed training capabilities.
# Create CPU-optimized Adam optimizer
optimizer = DeepSpeedCPUAdam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# Initialize DeepSpeed engine
model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(
    args=args,
    model=model,
    optimizer=optimizer,
    training_data=tokenized_dataset,
    collate_fn=default_data_collator
)

Usage Pattern

  1. Configure the DeepSpeed JSON with ZeRO Stage 3 and CPU offloading.
  2. Create a DeepSpeedCPUAdam optimizer.
  3. Call deepspeed.initialize() to wrap the model.
  4. Use the returned model_engine for forward/backward/step operations.
  5. The engine transparently handles all distributed communication and CPU offloading.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment