Environment:Microsoft DeepSpeedExamples SuperOffload Runtime

Knowledge Sources	DeepSpeed-SuperOffload SuperOffload README
Domains	Deep_Learning, Infrastructure, Optimization
Last Updated	2026-02-07 13:00 GMT

Overview

Linux environment with PyTorch >= 2.5.1, DeepSpeed >= 0.17.0, Flash Attention >= 2.0.0, and NVIDIA GH200/GB200 Superchips with NUMA binding for high-throughput CPU-offloaded fine-tuning of 8B-70B parameter models.

Description

This environment is purpose-built for NVIDIA Grace Hopper (GH200) and Grace Blackwell (GB200) Superchips, which feature integrated high-bandwidth CPU-GPU interconnects. DeepSpeed SuperOffload exploits this architecture to achieve ~50% higher throughput than standard ZeRO-Offload by using NUMA-aware CPU core binding, pinned memory, and aggressive CPU Adam parallelism (90% of CPU cores). The environment requires recent versions of DeepSpeed (0.17.0+) and PyTorch (2.5.1+) with Flash Attention support.

Usage

Use this environment for fine-tuning large language models (8B-70B parameters) on NVIDIA Superchips (GH200/GB200) with CPU offloading. It is the mandatory prerequisite for the Launch_Scripts_SuperOffload, Load_And_Preprocess_Dataset, Load_Model_SuperOffload, DeepSpeed_Initialize_SuperOffload, Main_Training_Loop_SuperOffload, and DeepSpeed_Save_Checkpoint implementations.

System Requirements

Category	Requirement	Notes
OS	Linux	NUMA support required; tested on Ubuntu 22.04
Hardware (1 GPU)	NVIDIA GH200 Superchip	For models up to 20B (GPT-OSS-20B, Phi-4, Qwen3-14B)
Hardware (2 GPUs)	2x NVIDIA GH200	For models up to 36B (Seed-OSS-36B, Qwen3-30B-A3B)
Hardware (4 GPUs)	4x NVIDIA GH200	For models up to 70B (LLaMA-3.3-70B)
NUMA	Required	Must bind CPU cores to GPU rank for optimal performance
MPAM	Recommended	Memory System Resource Partitioning and Monitoring for throughput

Dependencies

System Packages

CUDA Toolkit 12.x
Flash Attention 2 compatible GPU driver
`numactl` (for NUMA binding)

Python Packages

`torch` >= 2.5.1
`deepspeed` >= 0.17.0
`transformers` >= 4.56.1
`datasets` >= 4.0.0
`numpy` >= 1.21.0
`flash-attn` >= 2.0.0
`wandb` (optional, for logging)
`packaging`
`psutil`

Credentials

The following environment variables are used:

`TOKENIZERS_PARALLELISM`: Set to `"false"` (required, hardcoded in code)
`WANDB_API_KEY`: Weights & Biases API key (optional, for training logging)

Quick Install

# Install all required packages
pip install "torch>=2.5.1" "deepspeed>=0.17.0" "transformers>=4.56.1" \
    "datasets>=4.0.0" "numpy>=1.21.0" "flash-attn>=2.0.0" \
    wandb packaging psutil

Code Evidence

Requirements from `training/DeepSpeed-SuperOffload/requirements.txt`:

torch>=2.5.1
deepspeed>=0.17.0
datasets>=4.0.0
transformers>=4.56.1
numpy>=1.21.0
flash-attn>=2.0.0
wandb
packaging
psutil

Environment variable setting from `training/DeepSpeed-SuperOffload/finetune_zero3.py:27`:

os.environ["TOKENIZERS_PARALLELISM"] = "false"

SuperOffload ZeRO-3 configuration from shell scripts:

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true,
        "ratio": 0.90,
        "super_offload": true,
        "cpuadam_cores_perc": 0.90
    }
}

Common Errors

Error Message	Cause	Solution
`Failed to initialize WandB`	WandB not configured	Set `WANDB_API_KEY` or install `wandb`; non-critical (logging disabled gracefully)
Poor training throughput	NUMA binding not enabled	Add `--bind_cores_to_rank` to DeepSpeed launch command
`flash_attn not found`	Flash Attention not installed	`pip install flash-attn>=2.0.0` (requires CUDA 11.6+)
`CUDA out of memory`	Model too large for GPU count	Increase number of GH200 GPUs or reduce model size

Compatibility Notes

NVIDIA GH200/GB200 Only: SuperOffload is specifically optimized for Superchips with integrated CPU-GPU interconnect; standard GPUs will not achieve the same performance benefits
NUMA Binding Required: Use `--bind_cores_to_rank` flag for DeepSpeed launcher; without it, performance degrades significantly
Flash Attention: Required for compute-efficient attention; GPU must support Flash Attention 2 (Ampere or newer)
WandB: Optional; failures are handled gracefully by disabling logging

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment