Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples Launch Scripts SuperOffload

From Leeroopedia


Metadata

Field Value
Page Type Implementation
Title Launch_Scripts_SuperOffload
Repository Microsoft/DeepSpeedExamples
Type External Tool Doc
Code Reference File: training/DeepSpeed-SuperOffload/finetune_*.sh
Related Principle Principle:Microsoft_DeepSpeedExamples_SuperOffload_Environment

Overview

Concrete tool for launching SuperOffload fine-tuning with model-specific configurations via shell scripts. Each script encapsulates the full DeepSpeed launch command, inline JSON config generation, and per-model hyperparameter settings.

Description

The SuperOffload launch scripts are Bash scripts located at training/DeepSpeed-SuperOffload/. Each script is tailored to a specific model and GPU count. They share a common structure:

  1. Accept a mode argument (superoffload or zerooffload) and an optional batch size argument.
  2. Define model-specific parameters (model name, output directory, learning rate, sequence length, etc.).
  3. Generate a DeepSpeed JSON configuration file inline using a heredoc.
  4. Invoke deepspeed finetune_zero3.py with all required arguments.

For multi-GPU scripts (2+ GPUs), the --bind_cores_to_rank flag is added to the DeepSpeed launcher to enable NUMA binding.

Available Scripts

Script Model GPUs NUMA Binding
finetune_llama-8b_1gpu.sh meta-llama/Llama-3.1-8B 1 No
finetune_phi-4_1gpu.sh microsoft/phi-4 1 No
finetune_qwen3-14b_1gpu.sh Qwen/Qwen3-14B 1 No
finetune_gpt-oss-20b_1gpu.sh GPT-OSS-20B 1 No
finetune_seed-oss-36b_2gpu.sh Seed-OSS-36B 2 Yes
finetune_qwen3-30b-a3b_2gpu.sh Qwen/Qwen3-30B-A3B 2 Yes
finetune_llama-70b_4gpu.sh meta-llama/Llama-3.3-70B-Instruct 4 Yes

Script Structure

Each launch script follows this common pattern:

#!/bin/bash
set -e

# MODE=Options: "superoffload" or "zerooffload"
MODE=$1
BATCH_SIZE=${2:-4}

SCRIPT_DIR=$(dirname "$0")
MODEL_NAME="meta-llama/Llama-3.1-8B"
OUTPUT_DIR="${SCRIPT_DIR}/llama-8b_${MODE}_output"
DS_CONFIG_JSON="${SCRIPT_DIR}/llama-8b_${MODE}_config.json"

mkdir -p $OUTPUT_DIR

# Script argument parameters
ACTIVATION_CHECKPOINTING=true
SAVE_CHECKPOINT=false
MAX_LENGTH=4096
LOG_INTERVAL=1
DATASET_NAME="tatsu-lab/alpaca"
DATASET_PERCENTAGE=10.0
USE_WANDB=false
BENCH_STEPS=10
WARMUP_STEPS=20

EPOCHS=1
LR=1e-5
WARMUP=0.05
WEIGHT_DECAY=0.01
SEED=42

DeepSpeed JSON Configuration

The scripts generate two different JSON configs depending on the mode:

SuperOffload Mode

{
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "bf16": { "enabled": true },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_bucket_size": 4e8,
        "sub_group_size": 4e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true,
            "ratio": 0.90,
            "super_offload": true,
            "cpuadam_cores_perc": 0.90
        }
    },
    "wall_clock_breakdown": true
}

ZeRO-Offload Mode

{
    "train_batch_size": 4,
    "gradient_accumulation_steps": 1,
    "bf16": { "enabled": true },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": false,
        "reduce_bucket_size": 4e8,
        "sub_group_size": 4e8,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        }
    },
    "wall_clock_breakdown": true
}

The key difference is that SuperOffload mode adds "super_offload": true, "ratio", and "cpuadam_cores_perc" to the offload_optimizer block.

Launch Command

Single GPU (no NUMA binding)

deepspeed --num_gpus=1 finetune_zero3.py \
    --deepspeed_config=$DS_CONFIG_JSON \
    --model_name $MODEL_NAME \
    --num_train_epochs $EPOCHS \
    --lr $LR \
    --batch_size $BATCH_SIZE \
    --weight_decay $WEIGHT_DECAY \
    --output_dir $OUTPUT_DIR \
    --seed $SEED \
    --max_length $MAX_LENGTH \
    --log_interval $LOG_INTERVAL \
    --dataset_name $DATASET_NAME \
    --dataset_percentage $DATASET_PERCENTAGE \
    --bench_steps $BENCH_STEPS \
    --warmup_steps $WARMUP_STEPS \
    --activation_checkpointing

Multi-GPU (with NUMA binding)

deepspeed --num_gpus=4 --bind_cores_to_rank finetune_zero3.py \
    --deepspeed_config=$DS_CONFIG_JSON \
    --model_name $MODEL_NAME \
    --num_train_epochs $EPOCHS \
    --lr $LR \
    --batch_size $BATCH_SIZE \
    --weight_decay $WEIGHT_DECAY \
    --output_dir $OUTPUT_DIR \
    --seed $SEED \
    --max_length $MAX_LENGTH \
    --log_interval $LOG_INTERVAL \
    --dataset_name $DATASET_NAME \
    --dataset_percentage $DATASET_PERCENTAGE \
    --bench_steps $BENCH_STEPS \
    --warmup_steps $WARMUP_STEPS \
    --activation_checkpointing

The --bind_cores_to_rank flag enables NUMA binding, pairing each GPU with its physically closest CPU cores for optimal memory bandwidth.

Default Hyperparameters

Parameter Default Value Description
MODE (required) superoffload or zerooffload
BATCH_SIZE 4 Training batch size per device
EPOCHS 1 Number of training epochs
LR 1e-5 Learning rate
MAX_LENGTH 4096 Maximum sequence length
WEIGHT_DECAY 0.01 Weight decay
WARMUP 0.05 Warmup ratio
SEED 42 Random seed
DATASET_NAME tatsu-lab/alpaca HuggingFace dataset
DATASET_PERCENTAGE 10.0 Percentage of dataset to use
ACTIVATION_CHECKPOINTING true Enable activation checkpointing
BENCH_STEPS 10 Number of benchmark steps
WARMUP_STEPS 20 Warmup steps for performance measurement

Usage Examples

# Fine-tune Llama 8B with SuperOffload on 1 GPU
bash finetune_llama-8b_1gpu.sh superoffload

# Fine-tune Llama 8B with ZeRO-Offload on 1 GPU
bash finetune_llama-8b_1gpu.sh zerooffload

# Fine-tune Llama 70B with SuperOffload on 4 GPUs, batch size 8
bash finetune_llama-70b_4gpu.sh superoffload 8

# Fine-tune Phi-4 with SuperOffload on 1 GPU
bash finetune_phi-4_1gpu.sh superoffload

Model-Specific Offload Ratios

Model SuperOffload Ratio Notes
Llama 8B 0.80 Smaller model allows lower offload ratio
Phi-4 0.90 Standard ratio for mid-size models
Qwen3-14B 0.90 Standard ratio
Llama 70B 0.90 Higher ratio needed for large models
Seed-OSS-36B 0.90 Standard ratio

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment