Principle:Microsoft DeepSpeedExamples SuperOffload Environment
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | SuperOffload_Environment |
| Repository | Microsoft/DeepSpeedExamples |
| Sources | Doc: DeepSpeed https://www.deepspeed.ai/tutorials/zero-offloading/ ; Blog: SuperOffload https://github.com/microsoft/DeepSpeedExamples/tree/master/training/DeepSpeed-SuperOffload |
| Domains | Infrastructure, Distributed_Training |
| Status | Active |
| Related Implementation | Implementation:Microsoft_DeepSpeedExamples_Launch_Scripts_SuperOffload |
Overview
A deployment methodology for configuring CPU-offloaded distributed training to fine-tune large models (8B-70B+) on limited GPU hardware.
Description
SuperOffload uses ZeRO Stage 3 with CPU offloading for both parameters and optimizer states. It is an optimized CPU offloading engine designed for full-parameter training on emerging "Superchips" such as NVIDIA GH200 / GB200 and AMD MI300A, which provide very high CPU-to-GPU bandwidth. The core environment setup involves:
- ZeRO Stage 3 configuration -- All model parameters, gradients, and optimizer states are partitioned across available GPUs, with CPU RAM serving as overflow storage.
- NUMA binding via
--bind_cores_to_rank-- Ensures optimal CPU-GPU affinity so that each GPU is paired with the CPU directly associated with it. This improves bandwidth and throughput. - Shell scripts for per-model settings -- Each model variant has a dedicated launch script that configures learning rate, batch size, sequence length, activation checkpointing, and the DeepSpeed JSON config inline.
- SuperOffload-specific config flags -- The
super_offloadandcpuadam_cores_perckeys in the DeepSpeed JSON enable the optimized offloading engine and control what percentage of CPU cores are allocated for the CPUAdam optimizer.
The following models are supported with their respective GPU requirements:
| Model | GPUs Required | Example Script |
|---|---|---|
| GPT-OSS-20B | 1x GH200 | finetune_gpt-oss-20b_1gpu.sh
|
| Qwen3-14B | 1x GH200 | finetune_qwen3-14b_1gpu.sh
|
| Phi-4 | 1x GH200 | finetune_phi-4_1gpu.sh
|
| Llama 8B | 1x GH200 | finetune_llama-8b_1gpu.sh
|
| Seed-OSS-36B | 2x GH200 | finetune_seed-oss-36b_2gpu.sh
|
| Qwen3-30B-A3B | 2x GH200 | finetune_qwen3-30b-a3b_2gpu.sh
|
| Llama 70B | 4x GH200 | finetune_llama-70b_4gpu.sh
|
Dependencies
The environment requires the following packages (from requirements.txt):
torch>=2.5.1deepspeed>=0.17.0datasets>=4.0.0transformers>=4.56.1numpy>=1.21.0flash-attn>=2.0.0wandbpackagingpsutil
Theoretical Basis
ZeRO-3 + CPU Offload moves parameters and optimizer states to CPU RAM, reducing GPU memory consumption to activations only. This enables fine-tuning models whose total parameter count far exceeds the available GPU memory.
NUMA binding prevents cross-socket memory access penalties. On multi-socket systems (such as dual-socket GH200 configurations), binding each training process to the CPU cores physically attached to its corresponding GPU ensures that memory accesses remain local to the NUMA domain. Without binding, memory traffic may cross the inter-socket link, incurring latency penalties of 2-3x and reducing effective bandwidth.
Memory System Resource Partitioning and Monitoring (MPAM) is essential for optimal throughput. In SuperOffload, GPU execution is overlapped with CPU-based Adam execution. MPAM reduces interference between these two processes by partitioning cache and memory bandwidth resources, leading to smoother execution and better performance.
The SuperOffload engine achieves up to ~500 TFLOPS on GH200, approximately 50% higher throughput than standard ZeRO-Offload, by overlapping CPU optimizer computation with GPU forward/backward passes and utilizing optimized CPU-GPU data transfer patterns.
Usage Pattern
The typical environment setup flow is:
- Install dependencies:
pip install -r requirements.txt - Select the appropriate launch script for the target model and GPU count.
- Choose the mode:
superoffloadorzerooffload(passed as the first argument). - Optionally override batch size (passed as the second argument, default 4).
- Execute the script, which generates the DeepSpeed JSON config inline and launches training.
# Example: Fine-tune Llama 8B with SuperOffload on 1 GPU
bash finetune_llama-8b_1gpu.sh superoffload
# Example: Fine-tune Llama 70B with SuperOffload on 4 GPUs, batch size 8
bash finetune_llama-70b_4gpu.sh superoffload 8
# Example: Fall back to ZeRO-Offload
bash finetune_llama-8b_1gpu.sh zerooffload
DeepSpeed Configuration Structure
The SuperOffload mode generates the following JSON config:
{
"train_batch_size": 4,
"gradient_accumulation_steps": 1,
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"reduce_bucket_size": 4e8,
"sub_group_size": 4e8,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true,
"ratio": 0.90,
"super_offload": true,
"cpuadam_cores_perc": 0.90
}
},
"wall_clock_breakdown": true
}
Key configuration parameters:
| Parameter | Description | Typical Value |
|---|---|---|
stage |
ZeRO optimization stage | 3 |
overlap_comm |
Whether to overlap communication with computation | false |
reduce_bucket_size |
Size of gradient reduce buckets | 4e8 |
sub_group_size |
Sub-group size for parameter partitioning | 4e8 |
offload_optimizer.device |
Device for optimizer state offloading | cpu |
offload_optimizer.pin_memory |
Use pinned memory for CPU-GPU transfers | true |
offload_optimizer.ratio |
Fraction of optimizer work offloaded to CPU | 0.80-0.90 |
offload_optimizer.super_offload |
Enable SuperOffload engine | true |
offload_optimizer.cpuadam_cores_perc |
Percentage of CPU cores for CPUAdam | 0.90 |