Principle:Microsoft DeepSpeedExamples SuperOffload Environment

Metadata

Field	Value
Page Type	Principle
Title	SuperOffload_Environment
Repository	Microsoft/DeepSpeedExamples
Sources	Doc: DeepSpeed https://www.deepspeed.ai/tutorials/zero-offloading/ ; Blog: SuperOffload https://github.com/microsoft/DeepSpeedExamples/tree/master/training/DeepSpeed-SuperOffload
Domains	Infrastructure, Distributed_Training
Status	Active
Related Implementation	Implementation:Microsoft_DeepSpeedExamples_Launch_Scripts_SuperOffload

Overview

A deployment methodology for configuring CPU-offloaded distributed training to fine-tune large models (8B-70B+) on limited GPU hardware.

Description

SuperOffload uses ZeRO Stage 3 with CPU offloading for both parameters and optimizer states. It is an optimized CPU offloading engine designed for full-parameter training on emerging "Superchips" such as NVIDIA GH200 / GB200 and AMD MI300A, which provide very high CPU-to-GPU bandwidth. The core environment setup involves:

ZeRO Stage 3 configuration -- All model parameters, gradients, and optimizer states are partitioned across available GPUs, with CPU RAM serving as overflow storage.
NUMA binding via --bind_cores_to_rank -- Ensures optimal CPU-GPU affinity so that each GPU is paired with the CPU directly associated with it. This improves bandwidth and throughput.
Shell scripts for per-model settings -- Each model variant has a dedicated launch script that configures learning rate, batch size, sequence length, activation checkpointing, and the DeepSpeed JSON config inline.
SuperOffload-specific config flags -- The super_offload and cpuadam_cores_perc keys in the DeepSpeed JSON enable the optimized offloading engine and control what percentage of CPU cores are allocated for the CPUAdam optimizer.

The following models are supported with their respective GPU requirements:

Model	GPUs Required	Example Script
GPT-OSS-20B	1x GH200	`finetune_gpt-oss-20b_1gpu.sh`
Qwen3-14B	1x GH200	`finetune_qwen3-14b_1gpu.sh`
Phi-4	1x GH200	`finetune_phi-4_1gpu.sh`
Llama 8B	1x GH200	`finetune_llama-8b_1gpu.sh`
Seed-OSS-36B	2x GH200	`finetune_seed-oss-36b_2gpu.sh`
Qwen3-30B-A3B	2x GH200	`finetune_qwen3-30b-a3b_2gpu.sh`
Llama 70B	4x GH200	`finetune_llama-70b_4gpu.sh`

Dependencies

The environment requires the following packages (from requirements.txt):

torch>=2.5.1
deepspeed>=0.17.0
datasets>=4.0.0
transformers>=4.56.1
numpy>=1.21.0
flash-attn>=2.0.0
wandb
packaging
psutil

Theoretical Basis

ZeRO-3 + CPU Offload moves parameters and optimizer states to CPU RAM, reducing GPU memory consumption to activations only. This enables fine-tuning models whose total parameter count far exceeds the available GPU memory.

NUMA binding prevents cross-socket memory access penalties. On multi-socket systems (such as dual-socket GH200 configurations), binding each training process to the CPU cores physically attached to its corresponding GPU ensures that memory accesses remain local to the NUMA domain. Without binding, memory traffic may cross the inter-socket link, incurring latency penalties of 2-3x and reducing effective bandwidth.

Memory System Resource Partitioning and Monitoring (MPAM) is essential for optimal throughput. In SuperOffload, GPU execution is overlapped with CPU-based Adam execution. MPAM reduces interference between these two processes by partitioning cache and memory bandwidth resources, leading to smoother execution and better performance.

The SuperOffload engine achieves up to ~500 TFLOPS on GH200, approximately 50% higher throughput than standard ZeRO-Offload, by overlapping CPU optimizer computation with GPU forward/backward passes and utilizing optimized CPU-GPU data transfer patterns.

Usage Pattern

The typical environment setup flow is:

Install dependencies: pip install -r requirements.txt
Select the appropriate launch script for the target model and GPU count.
Choose the mode: superoffload or zerooffload (passed as the first argument).
Optionally override batch size (passed as the second argument, default 4).
Execute the script, which generates the DeepSpeed JSON config inline and launches training.

# Example: Fine-tune Llama 8B with SuperOffload on 1 GPU
bash finetune_llama-8b_1gpu.sh superoffload

# Example: Fine-tune Llama 70B with SuperOffload on 4 GPUs, batch size 8
bash finetune_llama-70b_4gpu.sh superoffload 8

# Example: Fall back to ZeRO-Offload
bash finetune_llama-8b_1gpu.sh zerooffload

DeepSpeed Configuration Structure

The SuperOffload mode generates the following JSON config:

{
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "bf16": { "enabled": true },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_bucket_size": 4e8,
        "sub_group_size": 4e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 0.90,
            "super_offload": true,
            "cpuadam_cores_perc": 0.90
        }
    },
    "wall_clock_breakdown": true
}

Key configuration parameters:

Parameter	Description	Typical Value
`stage`	ZeRO optimization stage	3
`overlap_comm`	Whether to overlap communication with computation	false
`reduce_bucket_size`	Size of gradient reduce buckets	4e8
`sub_group_size`	Sub-group size for parameter partitioning	4e8
`offload_optimizer.device`	Device for optimizer state offloading	cpu
`offload_optimizer.pin_memory`	Use pinned memory for CPU-GPU transfers	true
`offload_optimizer.ratio`	Fraction of optimizer work offloaded to CPU	0.80-0.90
`offload_optimizer.super_offload`	Enable SuperOffload engine	true
`offload_optimizer.cpuadam_cores_perc`	Percentage of CPU cores for CPUAdam	0.90

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment