Environment:Microsoft DeepSpeedExamples SuperOffload Runtime
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Infrastructure, Optimization |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
Linux environment with PyTorch >= 2.5.1, DeepSpeed >= 0.17.0, Flash Attention >= 2.0.0, and NVIDIA GH200/GB200 Superchips with NUMA binding for high-throughput CPU-offloaded fine-tuning of 8B-70B parameter models.
Description
This environment is purpose-built for NVIDIA Grace Hopper (GH200) and Grace Blackwell (GB200) Superchips, which feature integrated high-bandwidth CPU-GPU interconnects. DeepSpeed SuperOffload exploits this architecture to achieve ~50% higher throughput than standard ZeRO-Offload by using NUMA-aware CPU core binding, pinned memory, and aggressive CPU Adam parallelism (90% of CPU cores). The environment requires recent versions of DeepSpeed (0.17.0+) and PyTorch (2.5.1+) with Flash Attention support.
Usage
Use this environment for fine-tuning large language models (8B-70B parameters) on NVIDIA Superchips (GH200/GB200) with CPU offloading. It is the mandatory prerequisite for the Launch_Scripts_SuperOffload, Load_And_Preprocess_Dataset, Load_Model_SuperOffload, DeepSpeed_Initialize_SuperOffload, Main_Training_Loop_SuperOffload, and DeepSpeed_Save_Checkpoint implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | NUMA support required; tested on Ubuntu 22.04 |
| Hardware (1 GPU) | NVIDIA GH200 Superchip | For models up to 20B (GPT-OSS-20B, Phi-4, Qwen3-14B) |
| Hardware (2 GPUs) | 2x NVIDIA GH200 | For models up to 36B (Seed-OSS-36B, Qwen3-30B-A3B) |
| Hardware (4 GPUs) | 4x NVIDIA GH200 | For models up to 70B (LLaMA-3.3-70B) |
| NUMA | Required | Must bind CPU cores to GPU rank for optimal performance |
| MPAM | Recommended | Memory System Resource Partitioning and Monitoring for throughput |
Dependencies
System Packages
- CUDA Toolkit 12.x
- Flash Attention 2 compatible GPU driver
- `numactl` (for NUMA binding)
Python Packages
- `torch` >= 2.5.1
- `deepspeed` >= 0.17.0
- `transformers` >= 4.56.1
- `datasets` >= 4.0.0
- `numpy` >= 1.21.0
- `flash-attn` >= 2.0.0
- `wandb` (optional, for logging)
- `packaging`
- `psutil`
Credentials
The following environment variables are used:
- `TOKENIZERS_PARALLELISM`: Set to `"false"` (required, hardcoded in code)
- `WANDB_API_KEY`: Weights & Biases API key (optional, for training logging)
Quick Install
# Install all required packages
pip install "torch>=2.5.1" "deepspeed>=0.17.0" "transformers>=4.56.1" \
"datasets>=4.0.0" "numpy>=1.21.0" "flash-attn>=2.0.0" \
wandb packaging psutil
Code Evidence
Requirements from `training/DeepSpeed-SuperOffload/requirements.txt`:
torch>=2.5.1
deepspeed>=0.17.0
datasets>=4.0.0
transformers>=4.56.1
numpy>=1.21.0
flash-attn>=2.0.0
wandb
packaging
psutil
Environment variable setting from `training/DeepSpeed-SuperOffload/finetune_zero3.py:27`:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
SuperOffload ZeRO-3 configuration from shell scripts:
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true,
"ratio": 0.90,
"super_offload": true,
"cpuadam_cores_perc": 0.90
}
}
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Failed to initialize WandB` | WandB not configured | Set `WANDB_API_KEY` or install `wandb`; non-critical (logging disabled gracefully) |
| Poor training throughput | NUMA binding not enabled | Add `--bind_cores_to_rank` to DeepSpeed launch command |
| `flash_attn not found` | Flash Attention not installed | `pip install flash-attn>=2.0.0` (requires CUDA 11.6+) |
| `CUDA out of memory` | Model too large for GPU count | Increase number of GH200 GPUs or reduce model size |
Compatibility Notes
- NVIDIA GH200/GB200 Only: SuperOffload is specifically optimized for Superchips with integrated CPU-GPU interconnect; standard GPUs will not achieve the same performance benefits
- NUMA Binding Required: Use `--bind_cores_to_rank` flag for DeepSpeed launcher; without it, performance degrades significantly
- Flash Attention: Required for compute-efficient attention; GPU must support Flash Attention 2 (Ampere or newer)
- WandB: Optional; failures are handled gracefully by disabling logging
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Launch_Scripts_SuperOffload
- Implementation:Microsoft_DeepSpeedExamples_Load_And_Preprocess_Dataset
- Implementation:Microsoft_DeepSpeedExamples_Load_Model_SuperOffload
- Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Initialize_SuperOffload
- Implementation:Microsoft_DeepSpeedExamples_Main_Training_Loop_SuperOffload
- Implementation:Microsoft_DeepSpeedExamples_DeepSpeed_Save_Checkpoint