Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft DeepSpeedExamples SuperOffload Environment

From Leeroopedia
Revision as of 17:24, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Microsoft_DeepSpeedExamples_SuperOffload_Environment.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Metadata

Field Value
Page Type Principle
Title SuperOffload_Environment
Repository Microsoft/DeepSpeedExamples
Sources Doc: DeepSpeed https://www.deepspeed.ai/tutorials/zero-offloading/ ; Blog: SuperOffload https://github.com/microsoft/DeepSpeedExamples/tree/master/training/DeepSpeed-SuperOffload
Domains Infrastructure, Distributed_Training
Status Active
Related Implementation Implementation:Microsoft_DeepSpeedExamples_Launch_Scripts_SuperOffload

Overview

A deployment methodology for configuring CPU-offloaded distributed training to fine-tune large models (8B-70B+) on limited GPU hardware.

Description

SuperOffload uses ZeRO Stage 3 with CPU offloading for both parameters and optimizer states. It is an optimized CPU offloading engine designed for full-parameter training on emerging "Superchips" such as NVIDIA GH200 / GB200 and AMD MI300A, which provide very high CPU-to-GPU bandwidth. The core environment setup involves:

  • ZeRO Stage 3 configuration -- All model parameters, gradients, and optimizer states are partitioned across available GPUs, with CPU RAM serving as overflow storage.
  • NUMA binding via --bind_cores_to_rank -- Ensures optimal CPU-GPU affinity so that each GPU is paired with the CPU directly associated with it. This improves bandwidth and throughput.
  • Shell scripts for per-model settings -- Each model variant has a dedicated launch script that configures learning rate, batch size, sequence length, activation checkpointing, and the DeepSpeed JSON config inline.
  • SuperOffload-specific config flags -- The super_offload and cpuadam_cores_perc keys in the DeepSpeed JSON enable the optimized offloading engine and control what percentage of CPU cores are allocated for the CPUAdam optimizer.

The following models are supported with their respective GPU requirements:

Model GPUs Required Example Script
GPT-OSS-20B 1x GH200 finetune_gpt-oss-20b_1gpu.sh
Qwen3-14B 1x GH200 finetune_qwen3-14b_1gpu.sh
Phi-4 1x GH200 finetune_phi-4_1gpu.sh
Llama 8B 1x GH200 finetune_llama-8b_1gpu.sh
Seed-OSS-36B 2x GH200 finetune_seed-oss-36b_2gpu.sh
Qwen3-30B-A3B 2x GH200 finetune_qwen3-30b-a3b_2gpu.sh
Llama 70B 4x GH200 finetune_llama-70b_4gpu.sh

Dependencies

The environment requires the following packages (from requirements.txt):

  • torch>=2.5.1
  • deepspeed>=0.17.0
  • datasets>=4.0.0
  • transformers>=4.56.1
  • numpy>=1.21.0
  • flash-attn>=2.0.0
  • wandb
  • packaging
  • psutil

Theoretical Basis

ZeRO-3 + CPU Offload moves parameters and optimizer states to CPU RAM, reducing GPU memory consumption to activations only. This enables fine-tuning models whose total parameter count far exceeds the available GPU memory.

NUMA binding prevents cross-socket memory access penalties. On multi-socket systems (such as dual-socket GH200 configurations), binding each training process to the CPU cores physically attached to its corresponding GPU ensures that memory accesses remain local to the NUMA domain. Without binding, memory traffic may cross the inter-socket link, incurring latency penalties of 2-3x and reducing effective bandwidth.

Memory System Resource Partitioning and Monitoring (MPAM) is essential for optimal throughput. In SuperOffload, GPU execution is overlapped with CPU-based Adam execution. MPAM reduces interference between these two processes by partitioning cache and memory bandwidth resources, leading to smoother execution and better performance.

The SuperOffload engine achieves up to ~500 TFLOPS on GH200, approximately 50% higher throughput than standard ZeRO-Offload, by overlapping CPU optimizer computation with GPU forward/backward passes and utilizing optimized CPU-GPU data transfer patterns.

Usage Pattern

The typical environment setup flow is:

  1. Install dependencies: pip install -r requirements.txt
  2. Select the appropriate launch script for the target model and GPU count.
  3. Choose the mode: superoffload or zerooffload (passed as the first argument).
  4. Optionally override batch size (passed as the second argument, default 4).
  5. Execute the script, which generates the DeepSpeed JSON config inline and launches training.
# Example: Fine-tune Llama 8B with SuperOffload on 1 GPU
bash finetune_llama-8b_1gpu.sh superoffload

# Example: Fine-tune Llama 70B with SuperOffload on 4 GPUs, batch size 8
bash finetune_llama-70b_4gpu.sh superoffload 8

# Example: Fall back to ZeRO-Offload
bash finetune_llama-8b_1gpu.sh zerooffload

DeepSpeed Configuration Structure

The SuperOffload mode generates the following JSON config:

{
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "bf16": { "enabled": true },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_bucket_size": 4e8,
        "sub_group_size": 4e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 0.90,
            "super_offload": true,
            "cpuadam_cores_perc": 0.90
        }
    },
    "wall_clock_breakdown": true
}

Key configuration parameters:

Parameter Description Typical Value
stage ZeRO optimization stage 3
overlap_comm Whether to overlap communication with computation false
reduce_bucket_size Size of gradient reduce buckets 4e8
sub_group_size Sub-group size for parameter partitioning 4e8
offload_optimizer.device Device for optimizer state offloading cpu
offload_optimizer.pin_memory Use pinned memory for CPU-GPU transfers true
offload_optimizer.ratio Fraction of optimizer work offloaded to CPU 0.80-0.90
offload_optimizer.super_offload Enable SuperOffload engine true
offload_optimizer.cpuadam_cores_perc Percentage of CPU cores for CPUAdam 0.90

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment