Implementation:Microsoft DeepSpeedExamples Launch Scripts SuperOffload
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Launch_Scripts_SuperOffload |
| Repository | Microsoft/DeepSpeedExamples |
| Type | External Tool Doc |
| Code Reference | File: training/DeepSpeed-SuperOffload/finetune_*.sh
|
| Related Principle | Principle:Microsoft_DeepSpeedExamples_SuperOffload_Environment |
Overview
Concrete tool for launching SuperOffload fine-tuning with model-specific configurations via shell scripts. Each script encapsulates the full DeepSpeed launch command, inline JSON config generation, and per-model hyperparameter settings.
Description
The SuperOffload launch scripts are Bash scripts located at training/DeepSpeed-SuperOffload/. Each script is tailored to a specific model and GPU count. They share a common structure:
- Accept a mode argument (
superoffloadorzerooffload) and an optional batch size argument. - Define model-specific parameters (model name, output directory, learning rate, sequence length, etc.).
- Generate a DeepSpeed JSON configuration file inline using a heredoc.
- Invoke
deepspeed finetune_zero3.pywith all required arguments.
For multi-GPU scripts (2+ GPUs), the --bind_cores_to_rank flag is added to the DeepSpeed launcher to enable NUMA binding.
Available Scripts
| Script | Model | GPUs | NUMA Binding |
|---|---|---|---|
finetune_llama-8b_1gpu.sh |
meta-llama/Llama-3.1-8B | 1 | No |
finetune_phi-4_1gpu.sh |
microsoft/phi-4 | 1 | No |
finetune_qwen3-14b_1gpu.sh |
Qwen/Qwen3-14B | 1 | No |
finetune_gpt-oss-20b_1gpu.sh |
GPT-OSS-20B | 1 | No |
finetune_seed-oss-36b_2gpu.sh |
Seed-OSS-36B | 2 | Yes |
finetune_qwen3-30b-a3b_2gpu.sh |
Qwen/Qwen3-30B-A3B | 2 | Yes |
finetune_llama-70b_4gpu.sh |
meta-llama/Llama-3.3-70B-Instruct | 4 | Yes |
Script Structure
Each launch script follows this common pattern:
#!/bin/bash
set -e
# MODE=Options: "superoffload" or "zerooffload"
MODE=$1
BATCH_SIZE=${2:-4}
SCRIPT_DIR=$(dirname "$0")
MODEL_NAME="meta-llama/Llama-3.1-8B"
OUTPUT_DIR="${SCRIPT_DIR}/llama-8b_${MODE}_output"
DS_CONFIG_JSON="${SCRIPT_DIR}/llama-8b_${MODE}_config.json"
mkdir -p $OUTPUT_DIR
# Script argument parameters
ACTIVATION_CHECKPOINTING=true
SAVE_CHECKPOINT=false
MAX_LENGTH=4096
LOG_INTERVAL=1
DATASET_NAME="tatsu-lab/alpaca"
DATASET_PERCENTAGE=10.0
USE_WANDB=false
BENCH_STEPS=10
WARMUP_STEPS=20
EPOCHS=1
LR=1e-5
WARMUP=0.05
WEIGHT_DECAY=0.01
SEED=42
DeepSpeed JSON Configuration
The scripts generate two different JSON configs depending on the mode:
SuperOffload Mode
{
"train_batch_size": 4,
"gradient_accumulation_steps": 1,
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"reduce_bucket_size": 4e8,
"sub_group_size": 4e8,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true,
"ratio": 0.90,
"super_offload": true,
"cpuadam_cores_perc": 0.90
}
},
"wall_clock_breakdown": true
}
ZeRO-Offload Mode
{
"train_batch_size": 4,
"gradient_accumulation_steps": 1,
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"reduce_bucket_size": 4e8,
"sub_group_size": 4e8,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"wall_clock_breakdown": true
}
The key difference is that SuperOffload mode adds "super_offload": true, "ratio", and "cpuadam_cores_perc" to the offload_optimizer block.
Launch Command
Single GPU (no NUMA binding)
deepspeed --num_gpus=1 finetune_zero3.py \
--deepspeed_config=$DS_CONFIG_JSON \
--model_name $MODEL_NAME \
--num_train_epochs $EPOCHS \
--lr $LR \
--batch_size $BATCH_SIZE \
--weight_decay $WEIGHT_DECAY \
--output_dir $OUTPUT_DIR \
--seed $SEED \
--max_length $MAX_LENGTH \
--log_interval $LOG_INTERVAL \
--dataset_name $DATASET_NAME \
--dataset_percentage $DATASET_PERCENTAGE \
--bench_steps $BENCH_STEPS \
--warmup_steps $WARMUP_STEPS \
--activation_checkpointing
Multi-GPU (with NUMA binding)
deepspeed --num_gpus=4 --bind_cores_to_rank finetune_zero3.py \
--deepspeed_config=$DS_CONFIG_JSON \
--model_name $MODEL_NAME \
--num_train_epochs $EPOCHS \
--lr $LR \
--batch_size $BATCH_SIZE \
--weight_decay $WEIGHT_DECAY \
--output_dir $OUTPUT_DIR \
--seed $SEED \
--max_length $MAX_LENGTH \
--log_interval $LOG_INTERVAL \
--dataset_name $DATASET_NAME \
--dataset_percentage $DATASET_PERCENTAGE \
--bench_steps $BENCH_STEPS \
--warmup_steps $WARMUP_STEPS \
--activation_checkpointing
The --bind_cores_to_rank flag enables NUMA binding, pairing each GPU with its physically closest CPU cores for optimal memory bandwidth.
Default Hyperparameters
| Parameter | Default Value | Description |
|---|---|---|
MODE |
(required) | superoffload or zerooffload
|
BATCH_SIZE |
4 | Training batch size per device |
EPOCHS |
1 | Number of training epochs |
LR |
1e-5 | Learning rate |
MAX_LENGTH |
4096 | Maximum sequence length |
WEIGHT_DECAY |
0.01 | Weight decay |
WARMUP |
0.05 | Warmup ratio |
SEED |
42 | Random seed |
DATASET_NAME |
tatsu-lab/alpaca | HuggingFace dataset |
DATASET_PERCENTAGE |
10.0 | Percentage of dataset to use |
ACTIVATION_CHECKPOINTING |
true | Enable activation checkpointing |
BENCH_STEPS |
10 | Number of benchmark steps |
WARMUP_STEPS |
20 | Warmup steps for performance measurement |
Usage Examples
# Fine-tune Llama 8B with SuperOffload on 1 GPU
bash finetune_llama-8b_1gpu.sh superoffload
# Fine-tune Llama 8B with ZeRO-Offload on 1 GPU
bash finetune_llama-8b_1gpu.sh zerooffload
# Fine-tune Llama 70B with SuperOffload on 4 GPUs, batch size 8
bash finetune_llama-70b_4gpu.sh superoffload 8
# Fine-tune Phi-4 with SuperOffload on 1 GPU
bash finetune_phi-4_1gpu.sh superoffload
Model-Specific Offload Ratios
| Model | SuperOffload Ratio | Notes |
|---|---|---|
| Llama 8B | 0.80 | Smaller model allows lower offload ratio |
| Phi-4 | 0.90 | Standard ratio for mid-size models |
| Qwen3-14B | 0.90 | Standard ratio |
| Llama 70B | 0.90 | Higher ratio needed for large models |
| Seed-OSS-36B | 0.90 | Standard ratio |
Related Pages
- Principle:Microsoft_DeepSpeedExamples_SuperOffload_Environment
- Implementation:Microsoft_DeepSpeedExamples_Load_And_Preprocess_Dataset
- Implementation:Microsoft_DeepSpeedExamples_Load_Model_SuperOffload
- Environment:Microsoft_DeepSpeedExamples_SuperOffload_Runtime
- Heuristic:Microsoft_DeepSpeedExamples_SuperOffload_NUMA_Binding