Principle:Microsoft DeepSpeedExamples ZeRO3 CPU Offload Training

Metadata

Field	Value
Page Type	Principle
Title	ZeRO3_CPU_Offload_Training
Repository	Microsoft/DeepSpeedExamples
Sources	Paper: ZeRO https://arxiv.org/abs/1910.02054 ; Paper: ZeRO-Offload https://arxiv.org/abs/2101.06840
Domains	Distributed_Training, Memory_Optimization
Status	Active
Related Implementation	Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_SuperOffload

Overview

A distributed training technique that partitions all model states across GPUs and offloads to CPU memory, enabling fine-tuning of models that exceed total GPU memory.

Description

ZeRO (Zero Redundancy Optimizer) Stage 3 with CPU offloading is the core distributed training strategy used by SuperOffload. It addresses the fundamental memory bottleneck of fine-tuning large language models by combining two complementary techniques:

ZeRO Stage 3 partitioning -- Partitions parameters, gradients, and optimizer states across all available GPUs. Each GPU holds only 1/N of each model state (where N is the number of GPUs), eliminating redundant copies.
CPU offloading -- Moves optimizer states (and optionally parameters) to CPU RAM, using the CPU as overflow memory. This is particularly effective on Superchip architectures (GH200, GB200) where CPU-GPU bandwidth is exceptionally high.

The deepspeed.initialize() call wraps the model into a DeepSpeedEngine that manages all distributed operations transparently. From the user's perspective, the model behaves like a standard PyTorch model, but all parameter gathering, gradient reduction, and optimizer stepping happen across devices automatically.

DeepSpeedCPUAdam

The standard PyTorch Adam optimizer is not efficient for CPU-offloaded training because it requires transferring gradients to CPU, computing updates on CPU, and transferring updated parameters back to GPU. DeepSpeedCPUAdam is a highly optimized CPU Adam implementation that:

Uses SIMD (AVX2/AVX-512) instructions for vectorized math on CPU
Supports asynchronous parameter updates overlapped with GPU computation
Allocates a configurable percentage of CPU cores (cpuadam_cores_perc) for optimizer work
Reduces CPU optimizer time by up to 5-6x compared to standard PyTorch Adam on CPU

Theoretical Basis

Memory Analysis

For a model with P parameters and N GPUs, the memory breakdown per device is:

Component	Standard Data Parallel	ZeRO Stage 3	ZeRO-3 + CPU Offload
Parameters	P * bytes_per_param	P / N * bytes_per_param	Gathered on demand (transient)
Gradients	P * bytes_per_param	P / N * bytes_per_param	P / N * bytes_per_param (transient)
Optimizer States	P * 12 bytes (Adam FP32)	P / N * 12 bytes	Offloaded to CPU RAM
GPU Memory	~16P bytes	~16P/N bytes	Activations + temp buffers only

With ZeRO-3 + CPU offload, GPU memory per device is approximately equal to activation memory plus temporary communication buffers. All persistent model parameters and optimizer states reside in CPU RAM.

Communication Pattern

ZeRO Stage 3 introduces the following communication during training:

All-gather before forward/backward -- Each GPU gathers the full parameter tensor it needs for the current layer from all other GPUs. Parameters are gathered layer-by-layer (or sub-group by sub-group) and released after use.
Reduce-scatter after backward -- Gradients are reduced across GPUs and each GPU retains only its 1/N partition of the reduced gradient.
CPU-GPU transfers -- Optimizer states are read from CPU, updated using DeepSpeedCPUAdam, and updated parameters are written back to the GPU partition.

The communication volume per training step is:

Forward:  ~2P bytes (all-gather parameters, layer by layer)
Backward: ~2P bytes (all-gather parameters) + ~2P bytes (reduce-scatter gradients)
Total:    ~6P bytes per step

SuperOffload Optimization

SuperOffload extends ZeRO-3 CPU offloading with:

Overlapped CPU-GPU execution -- CPU optimizer steps execute concurrently with GPU forward/backward computation.
Configurable offload ratio -- The ratio parameter (0.0-1.0) controls what fraction of optimizer work is performed on CPU vs. GPU.
CPU core allocation -- The cpuadam_cores_perc parameter controls what percentage of CPU cores are dedicated to the CPUAdam optimizer.

DeepSpeed Configuration

The ZeRO-3 + CPU offload configuration is specified in a JSON config:

{
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "bf16": { "enabled": true },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_bucket_size": 4e8,
        "sub_group_size": 4e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 0.90,
            "super_offload": true,
            "cpuadam_cores_perc": 0.90
        }
    },
    "wall_clock_breakdown": true
}

Configuration Parameters

Parameter	Description	Impact
`stage: 3`	Full parameter, gradient, and optimizer state partitioning	Maximum memory savings
`overlap_comm`	Overlap communication with computation	Set to false for SuperOffload (different overlap strategy)
`reduce_bucket_size`	Size of gradient reduce-scatter buckets	Larger = fewer communications but more memory
`sub_group_size`	Sub-group size for parameter partitioning	Controls granularity of parameter gathering
`offload_optimizer.device`	Where to store optimizer states	`cpu` for CPU offloading
`offload_optimizer.pin_memory`	Use pinned memory for CPU-GPU transfers	`true` for faster transfers
`offload_optimizer.ratio`	Fraction of optimizer work on CPU	0.80-0.90 typical
`offload_optimizer.super_offload`	Enable SuperOffload engine	`true` to activate
`offload_optimizer.cpuadam_cores_perc`	CPU cores allocated for CPUAdam	0.90 typical

Initialization Pattern

The initialization follows this sequence:

Create a DeepSpeedCPUAdam optimizer with the model parameters.
Call deepspeed.initialize() with the model, optimizer, training data, and collate function.
The returned model_engine wraps the original model with distributed training capabilities.

# Create CPU-optimized Adam optimizer
optimizer = DeepSpeedCPUAdam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# Initialize DeepSpeed engine
model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(
    args=args,
    model=model,
    optimizer=optimizer,
    training_data=tokenized_dataset,
    collate_fn=default_data_collator
)

Usage Pattern

Configure the DeepSpeed JSON with ZeRO Stage 3 and CPU offloading.
Create a DeepSpeedCPUAdam optimizer.
Call deepspeed.initialize() to wrap the model.
Use the returned model_engine for forward/backward/step operations.
The engine transparently handles all distributed communication and CPU offloading.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment