Heuristic:Microsoft DeepSpeedExamples SuperOffload NUMA Binding

Knowledge Sources	SuperOffload README GH200 Benchmarks
Domains	Optimization, Infrastructure, Deep_Learning
Last Updated	2026-02-07 13:00 GMT

Overview

NUMA binding and CPU core allocation are required (not optional) for achieving SuperOffload's performance gains on GH200/GB200 Superchips, delivering ~50% higher throughput than standard ZeRO-Offload.

Description

NVIDIA GH200 and GB200 Superchips have an integrated CPU-GPU interconnect where each GPU has directly associated CPU cores and memory banks (NUMA nodes). Without NUMA binding, the OS scheduler may assign CPU computation to cores that are not physically paired with the GPU, causing cross-NUMA memory accesses that dramatically increase latency. SuperOffload exploits the co-located CPU-GPU architecture by running DeepSpeedCPUAdam on the directly associated cores. Enabling NUMA binding with 90% CPU core allocation for Adam computation is critical for achieving the advertised ~500 TFLOPS on GH200.

Usage

Apply this heuristic whenever using DeepSpeed SuperOffload on GH200 or GB200 hardware. This must be configured at launch time using the DeepSpeed launcher flags. If training throughput is significantly below ~500 TFLOPS on GH200, NUMA binding misconfiguration is the most likely cause.

The Insight (Rule of Thumb)

Action 1: Add `--bind_cores_to_rank` to the DeepSpeed launch command. This is required, not optional.
Action 2: Set `"cpuadam_cores_perc": 0.90` in the ZeRO configuration to allocate 90% of CPU cores to Adam computation.
Action 3: Set `"ratio": 0.90` for optimizer state offloading (offload 90% of optimizer states to CPU).
Action 4: Enable Memory System Resource Partitioning and Monitoring (MPAM) at the system level for optimal throughput.
Value: SuperOffload achieves ~500 TFLOPS on GH200 (roughly 50% higher than standard ZeRO-Offload).
Trade-off: Requires specific hardware (GH200/GB200 Superchips); standard GPUs will not achieve the same benefits from NUMA binding.

Reasoning

Modern NUMA systems have non-uniform memory access latencies. On a GH200 Superchip with 4 GPUs, each GPU has a paired NUMA node. Without binding, a CPU thread computing Adam updates for GPU 0 might be scheduled on a core in GPU 3's NUMA node, resulting in:

Cross-NUMA memory access latency (2-3x slower than local access)
PCIe bandwidth contention
CPU cache thrashing

With `--bind_cores_to_rank`, DeepSpeed ensures each GPU rank's CPU threads run exclusively on the paired NUMA node's cores. Combined with `cpuadam_cores_perc=0.90`, this saturates the local memory bandwidth for Adam computation while leaving 10% of cores for system tasks.

Code Evidence:

Environment variable from `training/DeepSpeed-SuperOffload/finetune_zero3.py:27`:

os.environ["TOKENIZERS_PARALLELISM"] = "false"

SuperOffload configuration from shell scripts:

"zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true,
        "ratio": 0.90,
        "super_offload": true,
        "cpuadam_cores_perc": 0.90
    }
}

Launch command pattern from `finetune_llama-8b_1gpu.sh`:

deepspeed --bind_cores_to_rank finetune_zero3.py \
    --model_name meta-llama/Llama-3.1-8B \
    --deepspeed_config ds_config.json

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment