Principle:Microsoft DeepSpeedExamples ZeRO3 CPU Offload Training
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | ZeRO3_CPU_Offload_Training |
| Repository | Microsoft/DeepSpeedExamples |
| Sources | Paper: ZeRO https://arxiv.org/abs/1910.02054 ; Paper: ZeRO-Offload https://arxiv.org/abs/2101.06840 |
| Domains | Distributed_Training, Memory_Optimization |
| Status | Active |
| Related Implementation | Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_SuperOffload |
Overview
A distributed training technique that partitions all model states across GPUs and offloads to CPU memory, enabling fine-tuning of models that exceed total GPU memory.
Description
ZeRO (Zero Redundancy Optimizer) Stage 3 with CPU offloading is the core distributed training strategy used by SuperOffload. It addresses the fundamental memory bottleneck of fine-tuning large language models by combining two complementary techniques:
- ZeRO Stage 3 partitioning -- Partitions parameters, gradients, and optimizer states across all available GPUs. Each GPU holds only 1/N of each model state (where N is the number of GPUs), eliminating redundant copies.
- CPU offloading -- Moves optimizer states (and optionally parameters) to CPU RAM, using the CPU as overflow memory. This is particularly effective on Superchip architectures (GH200, GB200) where CPU-GPU bandwidth is exceptionally high.
The deepspeed.initialize() call wraps the model into a DeepSpeedEngine that manages all distributed operations transparently. From the user's perspective, the model behaves like a standard PyTorch model, but all parameter gathering, gradient reduction, and optimizer stepping happen across devices automatically.
DeepSpeedCPUAdam
The standard PyTorch Adam optimizer is not efficient for CPU-offloaded training because it requires transferring gradients to CPU, computing updates on CPU, and transferring updated parameters back to GPU. DeepSpeedCPUAdam is a highly optimized CPU Adam implementation that:
- Uses SIMD (AVX2/AVX-512) instructions for vectorized math on CPU
- Supports asynchronous parameter updates overlapped with GPU computation
- Allocates a configurable percentage of CPU cores (
cpuadam_cores_perc) for optimizer work - Reduces CPU optimizer time by up to 5-6x compared to standard PyTorch Adam on CPU
Theoretical Basis
Memory Analysis
For a model with P parameters and N GPUs, the memory breakdown per device is:
| Component | Standard Data Parallel | ZeRO Stage 3 | ZeRO-3 + CPU Offload |
|---|---|---|---|
| Parameters | P * bytes_per_param | P / N * bytes_per_param | Gathered on demand (transient) |
| Gradients | P * bytes_per_param | P / N * bytes_per_param | P / N * bytes_per_param (transient) |
| Optimizer States | P * 12 bytes (Adam FP32) | P / N * 12 bytes | Offloaded to CPU RAM |
| GPU Memory | ~16P bytes | ~16P/N bytes | Activations + temp buffers only |
With ZeRO-3 + CPU offload, GPU memory per device is approximately equal to activation memory plus temporary communication buffers. All persistent model parameters and optimizer states reside in CPU RAM.
Communication Pattern
ZeRO Stage 3 introduces the following communication during training:
- All-gather before forward/backward -- Each GPU gathers the full parameter tensor it needs for the current layer from all other GPUs. Parameters are gathered layer-by-layer (or sub-group by sub-group) and released after use.
- Reduce-scatter after backward -- Gradients are reduced across GPUs and each GPU retains only its 1/N partition of the reduced gradient.
- CPU-GPU transfers -- Optimizer states are read from CPU, updated using DeepSpeedCPUAdam, and updated parameters are written back to the GPU partition.
The communication volume per training step is:
Forward: ~2P bytes (all-gather parameters, layer by layer) Backward: ~2P bytes (all-gather parameters) + ~2P bytes (reduce-scatter gradients) Total: ~6P bytes per step
SuperOffload Optimization
SuperOffload extends ZeRO-3 CPU offloading with:
- Overlapped CPU-GPU execution -- CPU optimizer steps execute concurrently with GPU forward/backward computation.
- Configurable offload ratio -- The
ratioparameter (0.0-1.0) controls what fraction of optimizer work is performed on CPU vs. GPU. - CPU core allocation -- The
cpuadam_cores_percparameter controls what percentage of CPU cores are dedicated to the CPUAdam optimizer.
DeepSpeed Configuration
The ZeRO-3 + CPU offload configuration is specified in a JSON config:
{
"train_batch_size": 4,
"gradient_accumulation_steps": 1,
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"reduce_bucket_size": 4e8,
"sub_group_size": 4e8,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true,
"ratio": 0.90,
"super_offload": true,
"cpuadam_cores_perc": 0.90
}
},
"wall_clock_breakdown": true
}
Configuration Parameters
| Parameter | Description | Impact |
|---|---|---|
stage: 3 |
Full parameter, gradient, and optimizer state partitioning | Maximum memory savings |
overlap_comm |
Overlap communication with computation | Set to false for SuperOffload (different overlap strategy) |
reduce_bucket_size |
Size of gradient reduce-scatter buckets | Larger = fewer communications but more memory |
sub_group_size |
Sub-group size for parameter partitioning | Controls granularity of parameter gathering |
offload_optimizer.device |
Where to store optimizer states | cpu for CPU offloading
|
offload_optimizer.pin_memory |
Use pinned memory for CPU-GPU transfers | true for faster transfers
|
offload_optimizer.ratio |
Fraction of optimizer work on CPU | 0.80-0.90 typical |
offload_optimizer.super_offload |
Enable SuperOffload engine | true to activate
|
offload_optimizer.cpuadam_cores_perc |
CPU cores allocated for CPUAdam | 0.90 typical |
Initialization Pattern
The initialization follows this sequence:
- Create a DeepSpeedCPUAdam optimizer with the model parameters.
- Call
deepspeed.initialize()with the model, optimizer, training data, and collate function. - The returned
model_enginewraps the original model with distributed training capabilities.
# Create CPU-optimized Adam optimizer
optimizer = DeepSpeedCPUAdam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
# Initialize DeepSpeed engine
model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(
args=args,
model=model,
optimizer=optimizer,
training_data=tokenized_dataset,
collate_fn=default_data_collator
)
Usage Pattern
- Configure the DeepSpeed JSON with ZeRO Stage 3 and CPU offloading.
- Create a
DeepSpeedCPUAdamoptimizer. - Call
deepspeed.initialize()to wrap the model. - Use the returned
model_enginefor forward/backward/step operations. - The engine transparently handles all distributed communication and CPU offloading.
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_SuperOffload
- Principle:Microsoft_DeepSpeedExamples_SuperOffload_Environment
- Principle:Microsoft_DeepSpeedExamples_DeepSpeed_Training_Loop
- Principle:Microsoft_DeepSpeedExamples_Large_Model_Loading
- Heuristic:Microsoft_DeepSpeedExamples_SuperOffload_NUMA_Binding