Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Microsoft DeepSpeedExamples SuperOffload Runtime

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Infrastructure, Optimization
Last Updated 2026-02-07 13:00 GMT

Overview

Linux environment with PyTorch >= 2.5.1, DeepSpeed >= 0.17.0, Flash Attention >= 2.0.0, and NVIDIA GH200/GB200 Superchips with NUMA binding for high-throughput CPU-offloaded fine-tuning of 8B-70B parameter models.

Description

This environment is purpose-built for NVIDIA Grace Hopper (GH200) and Grace Blackwell (GB200) Superchips, which feature integrated high-bandwidth CPU-GPU interconnects. DeepSpeed SuperOffload exploits this architecture to achieve ~50% higher throughput than standard ZeRO-Offload by using NUMA-aware CPU core binding, pinned memory, and aggressive CPU Adam parallelism (90% of CPU cores). The environment requires recent versions of DeepSpeed (0.17.0+) and PyTorch (2.5.1+) with Flash Attention support.

Usage

Use this environment for fine-tuning large language models (8B-70B parameters) on NVIDIA Superchips (GH200/GB200) with CPU offloading. It is the mandatory prerequisite for the Launch_Scripts_SuperOffload, Load_And_Preprocess_Dataset, Load_Model_SuperOffload, DeepSpeed_Initialize_SuperOffload, Main_Training_Loop_SuperOffload, and DeepSpeed_Save_Checkpoint implementations.

System Requirements

Category Requirement Notes
OS Linux NUMA support required; tested on Ubuntu 22.04
Hardware (1 GPU) NVIDIA GH200 Superchip For models up to 20B (GPT-OSS-20B, Phi-4, Qwen3-14B)
Hardware (2 GPUs) 2x NVIDIA GH200 For models up to 36B (Seed-OSS-36B, Qwen3-30B-A3B)
Hardware (4 GPUs) 4x NVIDIA GH200 For models up to 70B (LLaMA-3.3-70B)
NUMA Required Must bind CPU cores to GPU rank for optimal performance
MPAM Recommended Memory System Resource Partitioning and Monitoring for throughput

Dependencies

System Packages

  • CUDA Toolkit 12.x
  • Flash Attention 2 compatible GPU driver
  • `numactl` (for NUMA binding)

Python Packages

  • `torch` >= 2.5.1
  • `deepspeed` >= 0.17.0
  • `transformers` >= 4.56.1
  • `datasets` >= 4.0.0
  • `numpy` >= 1.21.0
  • `flash-attn` >= 2.0.0
  • `wandb` (optional, for logging)
  • `packaging`
  • `psutil`

Credentials

The following environment variables are used:

  • `TOKENIZERS_PARALLELISM`: Set to `"false"` (required, hardcoded in code)
  • `WANDB_API_KEY`: Weights & Biases API key (optional, for training logging)

Quick Install

# Install all required packages
pip install "torch>=2.5.1" "deepspeed>=0.17.0" "transformers>=4.56.1" \
    "datasets>=4.0.0" "numpy>=1.21.0" "flash-attn>=2.0.0" \
    wandb packaging psutil

Code Evidence

Requirements from `training/DeepSpeed-SuperOffload/requirements.txt`:

torch>=2.5.1
deepspeed>=0.17.0
datasets>=4.0.0
transformers>=4.56.1
numpy>=1.21.0
flash-attn>=2.0.0
wandb
packaging
psutil

Environment variable setting from `training/DeepSpeed-SuperOffload/finetune_zero3.py:27`:

os.environ["TOKENIZERS_PARALLELISM"] = "false"

SuperOffload ZeRO-3 configuration from shell scripts:

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true,
        "ratio": 0.90,
        "super_offload": true,
        "cpuadam_cores_perc": 0.90
    }
}

Common Errors

Error Message Cause Solution
`Failed to initialize WandB` WandB not configured Set `WANDB_API_KEY` or install `wandb`; non-critical (logging disabled gracefully)
Poor training throughput NUMA binding not enabled Add `--bind_cores_to_rank` to DeepSpeed launch command
`flash_attn not found` Flash Attention not installed `pip install flash-attn>=2.0.0` (requires CUDA 11.6+)
`CUDA out of memory` Model too large for GPU count Increase number of GH200 GPUs or reduce model size

Compatibility Notes

  • NVIDIA GH200/GB200 Only: SuperOffload is specifically optimized for Superchips with integrated CPU-GPU interconnect; standard GPUs will not achieve the same performance benefits
  • NUMA Binding Required: Use `--bind_cores_to_rank` flag for DeepSpeed launcher; without it, performance degrades significantly
  • Flash Attention: Required for compute-efficient attention; GPU must support Flash Attention 2 (Ampere or newer)
  • WandB: Optional; failures are handled gracefully by disabling logging

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment