Implementation:Microsoft DeepSpeedExamples DeepSpeed Initialize SuperOffload
Metadata
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | DeepSpeed_Initialize_SuperOffload |
| Repository | Microsoft/DeepSpeedExamples |
| Type | Wrapper Doc (wraps deepspeed.initialize)
|
| Code Reference | File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 253-259
|
| Import | import deepspeed, from deepspeed.ops.adam import DeepSpeedCPUAdam
|
| Related Principle | Principle:Microsoft_DeepSpeedExamples_ZeRO3_CPU_Offload_Training |
Overview
Concrete usage of deepspeed.initialize() with ZeRO-3 CPU offloading and DeepSpeedCPUAdam for SuperOffload fine-tuning. This implementation wraps the standard DeepSpeed initialization call and the creation of the CPU-optimized Adam optimizer.
Function: create_optimizer
Signature
def create_optimizer(model: AutoModelForCausalLM) -> Any:
Code Reference: File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 161-168
Description
Creates a DeepSpeedCPUAdam optimizer instance for all model parameters. This optimizer is specifically designed for CPU-offloaded training and provides highly optimized Adam updates using SIMD instructions on the CPU.
Implementation
def create_optimizer(model: AutoModelForCausalLM) -> Any:
from deepspeed.ops.adam import DeepSpeedCPUAdam
optimizer = DeepSpeedCPUAdam(
model.parameters(),
lr=DEFAULT_OPTIMIZER_LR,
betas=DEFAULT_OPTIMIZER_BETAS
)
return optimizer
I/O Contract
| Parameter | Type | Description |
|---|---|---|
model |
AutoModelForCausalLM |
The loaded HuggingFace model whose parameters will be optimized |
Returns: DeepSpeedCPUAdam optimizer instance.
Constants used:
DEFAULT_OPTIMIZER_LR = 0.001(overridden by DeepSpeed config at runtime)DEFAULT_OPTIMIZER_BETAS = (0.9, 0.999)
DeepSpeed Initialization Call
Code Reference
File: training/DeepSpeed-SuperOffload/finetune_zero3.py, Lines 253-259
Implementation
# Initialize DeepSpeed
model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(
args=args,
model=model,
optimizer=optimizer,
training_data=tokenized_dataset,
collate_fn=default_data_collator
)
I/O Contract
Inputs:
| Parameter | Type | Description |
|---|---|---|
args |
argparse.Namespace |
Parsed command-line arguments (must include --deepspeed_config pointing to the JSON config file)
|
model |
AutoModelForCausalLM |
The loaded and configured model (with gradient checkpointing enabled) |
optimizer |
DeepSpeedCPUAdam |
The CPU-optimized Adam optimizer |
training_data |
Dataset |
The tokenized HuggingFace Dataset (used by DeepSpeed for distributed sampling) |
collate_fn |
Callable |
Data collation function (default_data_collator from transformers)
|
Outputs:
| Return Value | Type | Description |
|---|---|---|
model_engine |
DeepSpeedEngine |
The wrapped model with distributed training capabilities |
optimizer |
DeepSpeedCPUAdam |
The optimizer (potentially wrapped by DeepSpeed) |
train_dataloader |
DataLoader |
The distributed-aware DataLoader (potentially modified by DeepSpeed for distributed sampling) |
_ |
LRScheduler |
Learning rate scheduler (unused in this implementation, discarded) |
DeepSpeed JSON Config Structure
The deepspeed.initialize() call reads the configuration from the JSON file specified by --deepspeed_config. For SuperOffload, the config has this structure:
{
"train_batch_size": 4,
"gradient_accumulation_steps": 1,
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"overlap_comm": false,
"reduce_bucket_size": 4e8,
"sub_group_size": 4e8,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true,
"ratio": 0.90,
"super_offload": true,
"cpuadam_cores_perc": 0.90
}
},
"wall_clock_breakdown": true
}
What deepspeed.initialize() Does
The deepspeed.initialize() call performs the following operations:
- Distributed initialization -- Sets up the distributed process group (NCCL backend) if not already initialized.
- Model wrapping -- Wraps the model in a
DeepSpeedEnginethat intercepts forward, backward, and step calls. - Parameter partitioning -- Partitions all model parameters across GPUs according to ZeRO Stage 3 rules. Each GPU retains only 1/N of each parameter tensor.
- Optimizer state offloading -- Moves optimizer states (momentum, variance for Adam) to CPU RAM with pinned memory for efficient transfers.
- DataLoader creation -- Creates a distributed-aware DataLoader with the
DistributedSamplerto ensure each GPU processes different data. - Communication setup -- Configures reduce buckets and sub-groups for efficient gradient reduction.
Full Initialization Sequence in main()
The complete initialization sequence as it appears in the main() function (Lines 239-263):
# Step 1: Load tokenizer and model
tokenizer = load_tokenizer(args.model_name, logger)
model = load_model(args.model_name, args.attn_implementation, logger)
# Step 2: Optional MoE leaf module configuration
if args.leaf_module:
from deepspeed.utils import set_z3_leaf_modules
logger.debug(f"Setting leaf_module to: {args.leaf_module}")
set_z3_leaf_modules(model, [args.leaf_module])
# Step 3: Configure model for training
setup_model_training(model, args.activation_checkpointing, logger)
# Step 4: Create CPU-optimized optimizer
optimizer = create_optimizer(model)
# Step 5: Load and preprocess dataset
tokenized_dataset, train_dataloader = load_and_preprocess_dataset(
args.dataset_name, args.dataset_percentage, tokenizer, args.max_length, logger
)
# Step 6: Initialize DeepSpeed engine
model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(
args=args,
model=model,
optimizer=optimizer,
training_data=tokenized_dataset,
collate_fn=default_data_collator
)
# Step 7: Re-initialize logger with distributed rank
logger = setup_logger(rank=dist.get_rank(), log_level=args.log_level)
Usage Example
import argparse
import deepspeed
from deepspeed.ops.adam import DeepSpeedCPUAdam
from transformers import AutoModelForCausalLM, default_data_collator
# Assume model and tokenized_dataset are already loaded
optimizer = DeepSpeedCPUAdam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
# args must contain --deepspeed_config pointing to the JSON config
model_engine, optimizer, train_dataloader, _ = deepspeed.initialize(
args=args,
model=model,
optimizer=optimizer,
training_data=tokenized_dataset,
collate_fn=default_data_collator
)
# model_engine is now ready for training
model_engine.train()
Related Pages
- Principle:Microsoft_DeepSpeedExamples_ZeRO3_CPU_Offload_Training
- Implementation:Microsoft_DeepSpeedExamples_Load_Model_SuperOffload
- Implementation:Microsoft_DeepSpeedExamples_Main_Training_Loop_SuperOffload
- Environment:Microsoft_DeepSpeedExamples_SuperOffload_Runtime
- Heuristic:Microsoft_DeepSpeedExamples_SuperOffload_NUMA_Binding