Implementation:Microsoft DeepSpeedExamples Launch Scripts SuperOffload

Metadata

Field	Value
Page Type	Implementation
Title	Launch_Scripts_SuperOffload
Repository	Microsoft/DeepSpeedExamples
Type	External Tool Doc
Code Reference	File: `training/DeepSpeed-SuperOffload/finetune_*.sh`
Related Principle	Principle:Microsoft_DeepSpeedExamples_SuperOffload_Environment

Overview

Concrete tool for launching SuperOffload fine-tuning with model-specific configurations via shell scripts. Each script encapsulates the full DeepSpeed launch command, inline JSON config generation, and per-model hyperparameter settings.

Description

The SuperOffload launch scripts are Bash scripts located at training/DeepSpeed-SuperOffload/. Each script is tailored to a specific model and GPU count. They share a common structure:

Accept a mode argument (superoffload or zerooffload) and an optional batch size argument.
Define model-specific parameters (model name, output directory, learning rate, sequence length, etc.).
Generate a DeepSpeed JSON configuration file inline using a heredoc.
Invoke deepspeed finetune_zero3.py with all required arguments.

For multi-GPU scripts (2+ GPUs), the --bind_cores_to_rank flag is added to the DeepSpeed launcher to enable NUMA binding.

Available Scripts

Script	Model	GPUs	NUMA Binding
`finetune_llama-8b_1gpu.sh`	meta-llama/Llama-3.1-8B	1	No
`finetune_phi-4_1gpu.sh`	microsoft/phi-4	1	No
`finetune_qwen3-14b_1gpu.sh`	Qwen/Qwen3-14B	1	No
`finetune_gpt-oss-20b_1gpu.sh`	GPT-OSS-20B	1	No
`finetune_seed-oss-36b_2gpu.sh`	Seed-OSS-36B	2	Yes
`finetune_qwen3-30b-a3b_2gpu.sh`	Qwen/Qwen3-30B-A3B	2	Yes
`finetune_llama-70b_4gpu.sh`	meta-llama/Llama-3.3-70B-Instruct	4	Yes

Script Structure

Each launch script follows this common pattern:

#!/bin/bash
set -e

# MODE=Options: "superoffload" or "zerooffload"
MODE=$1
BATCH_SIZE=${2:-4}

SCRIPT_DIR=$(dirname "$0")
MODEL_NAME="meta-llama/Llama-3.1-8B"
OUTPUT_DIR="${SCRIPT_DIR}/llama-8b_${MODE}_output"
DS_CONFIG_JSON="${SCRIPT_DIR}/llama-8b_${MODE}_config.json"

mkdir -p $OUTPUT_DIR

# Script argument parameters
ACTIVATION_CHECKPOINTING=true
SAVE_CHECKPOINT=false
MAX_LENGTH=4096
LOG_INTERVAL=1
DATASET_NAME="tatsu-lab/alpaca"
DATASET_PERCENTAGE=10.0
USE_WANDB=false
BENCH_STEPS=10
WARMUP_STEPS=20

EPOCHS=1
LR=1e-5
WARMUP=0.05
WEIGHT_DECAY=0.01
SEED=42

DeepSpeed JSON Configuration

The scripts generate two different JSON configs depending on the mode:

SuperOffload Mode

{
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "bf16": { "enabled": true },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_bucket_size": 4e8,
        "sub_group_size": 4e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 0.90,
            "super_offload": true,
            "cpuadam_cores_perc": 0.90
        }
    },
    "wall_clock_breakdown": true
}

ZeRO-Offload Mode

{
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "bf16": { "enabled": true },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_bucket_size": 4e8,
        "sub_group_size": 4e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        }
    },
    "wall_clock_breakdown": true
}

The key difference is that SuperOffload mode adds "super_offload": true, "ratio", and "cpuadam_cores_perc" to the offload_optimizer block.

Launch Command

Single GPU (no NUMA binding)

deepspeed --num_gpus=1 finetune_zero3.py \
    --deepspeed_config=$DS_CONFIG_JSON \
    --model_name $MODEL_NAME \
    --num_train_epochs $EPOCHS \
    --lr $LR \
    --batch_size $BATCH_SIZE \
    --weight_decay $WEIGHT_DECAY \
    --output_dir $OUTPUT_DIR \
    --seed $SEED \
    --max_length $MAX_LENGTH \
    --log_interval $LOG_INTERVAL \
    --dataset_name $DATASET_NAME \
    --dataset_percentage $DATASET_PERCENTAGE \
    --bench_steps $BENCH_STEPS \
    --warmup_steps $WARMUP_STEPS \
    --activation_checkpointing

Multi-GPU (with NUMA binding)

deepspeed --num_gpus=4 --bind_cores_to_rank finetune_zero3.py \
    --deepspeed_config=$DS_CONFIG_JSON \
    --model_name $MODEL_NAME \
    --num_train_epochs $EPOCHS \
    --lr $LR \
    --batch_size $BATCH_SIZE \
    --weight_decay $WEIGHT_DECAY \
    --output_dir $OUTPUT_DIR \
    --seed $SEED \
    --max_length $MAX_LENGTH \
    --log_interval $LOG_INTERVAL \
    --dataset_name $DATASET_NAME \
    --dataset_percentage $DATASET_PERCENTAGE \
    --bench_steps $BENCH_STEPS \
    --warmup_steps $WARMUP_STEPS \
    --activation_checkpointing

The --bind_cores_to_rank flag enables NUMA binding, pairing each GPU with its physically closest CPU cores for optimal memory bandwidth.

Default Hyperparameters

Parameter	Default Value	Description
`MODE`	(required)	`superoffload` or `zerooffload`
`BATCH_SIZE`	4	Training batch size per device
`EPOCHS`	1	Number of training epochs
`LR`	1e-5	Learning rate
`MAX_LENGTH`	4096	Maximum sequence length
`WEIGHT_DECAY`	0.01	Weight decay
`WARMUP`	0.05	Warmup ratio
`SEED`	42	Random seed
`DATASET_NAME`	tatsu-lab/alpaca	HuggingFace dataset
`DATASET_PERCENTAGE`	10.0	Percentage of dataset to use
`ACTIVATION_CHECKPOINTING`	true	Enable activation checkpointing
`BENCH_STEPS`	10	Number of benchmark steps
`WARMUP_STEPS`	20	Warmup steps for performance measurement

Usage Examples

# Fine-tune Llama 8B with SuperOffload on 1 GPU
bash finetune_llama-8b_1gpu.sh superoffload

# Fine-tune Llama 8B with ZeRO-Offload on 1 GPU
bash finetune_llama-8b_1gpu.sh zerooffload

# Fine-tune Llama 70B with SuperOffload on 4 GPUs, batch size 8
bash finetune_llama-70b_4gpu.sh superoffload 8

# Fine-tune Phi-4 with SuperOffload on 1 GPU
bash finetune_phi-4_1gpu.sh superoffload

Model-Specific Offload Ratios

Model	SuperOffload Ratio	Notes
Llama 8B	0.80	Smaller model allows lower offload ratio
Phi-4	0.90	Standard ratio for mid-size models
Qwen3-14B	0.90	Standard ratio
Llama 70B	0.90	Higher ratio needed for large models
Seed-OSS-36B	0.90	Standard ratio

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment