Workflow:Microsoft DeepSpeedExamples SuperOffload Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Memory_Optimization, Distributed_Training |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
End-to-end process for fine-tuning large language models (8B to 70B+ parameters) on limited GPU hardware using DeepSpeed ZeRO Stage 3 with SuperOffload CPU offloading optimization.
Description
This workflow enables fine-tuning of large language models that would not normally fit on available GPU hardware. It uses DeepSpeed ZeRO Stage 3 to partition model states (parameters, gradients, optimizer states) across GPUs and CPU memory, with SuperOffload providing optimized CPU offloading for NVIDIA Superchip architectures (GH200/GB200) and AMD MI300A.
Goal: A fine-tuned language model adapted to a specific domain or instruction-following task, trained from a pretrained base model on custom data.
Scope: Covers model loading, dataset preparation (Alpaca format), DeepSpeed ZeRO-3 configuration with CPU offloading, training loop execution, and checkpoint saving.
Strategy: Uses ZeRO Stage 3 to partition all model states across available devices, with SuperOffload optimizing CPU-GPU data transfer. Combines gradient checkpointing for activation memory reduction with DeepSpeedCPUAdam for efficient CPU-based optimizer computation.
Usage
Execute this workflow when you need to fine-tune a large language model (8B-70B+ parameters) but have limited GPU memory. This is designed for scenarios with 1-4 GPUs where the model would not fit using standard approaches. Particularly optimized for NVIDIA GH200/GB200 Superchips with unified CPU-GPU memory, but works on any hardware with sufficient combined CPU+GPU memory.
Execution Steps
Step 1: Environment Setup
Configure the training environment with DeepSpeed, appropriate CUDA settings, and NUMA bindings for optimal CPU offloading performance.
Key considerations:
- Install DeepSpeed with CPU Adam support
- For GH200/GB200 systems, configure NUMA bindings for optimal memory access patterns
- Set appropriate CUDA memory allocation settings to prevent fragmentation
- Configure WandB (Weights & Biases) for experiment tracking if desired
Step 2: Dataset Preparation
Load and preprocess the training dataset into the instruction-following format expected by the model.
What happens:
- Load the Alpaca instruction-following dataset from HuggingFace
- Format examples into instruction/input/output structure
- Tokenize with appropriate max sequence length (2048 tokens)
- Create a DataLoader with distributed sampling for multi-GPU training
Step 3: Model Loading
Load the pretrained model with ZeRO Stage 3 configuration for memory-efficient initialization.
What happens:
- Load tokenizer from HuggingFace model hub
- Load pretrained model (e.g., LLaMA-3.1-8B, Phi-4, Qwen3-14B) with appropriate dtype
- Enable gradient checkpointing to reduce activation memory by recomputing activations during backward pass
- Configure DeepSpeedCPUAdam optimizer for efficient CPU-based parameter updates
Step 4: DeepSpeed Initialization
Initialize the DeepSpeed engine with ZeRO Stage 3 and SuperOffload configuration.
What happens:
- Configure ZeRO Stage 3 with parameter, gradient, and optimizer state partitioning
- Enable CPU offloading for optimizer states (90% offload ratio) and parameters
- Set pin_memory for faster CPU-GPU transfers
- Initialize DeepSpeed engine wrapping the model, optimizer, and data loader
- Configure mixed precision training (bf16)
Step 5: Training Loop
Execute the fine-tuning training loop with loss computation and gradient updates.
What happens:
- Iterate over the dataset for the configured number of epochs
- Compute causal language modeling loss on instruction-response sequences
- DeepSpeed handles gradient accumulation, communication, and optimizer steps automatically
- Log training metrics (loss, throughput in TFLOPS) at regular intervals
- Optionally sync metrics with WandB for experiment tracking
Step 6: Checkpoint Saving
Save the fine-tuned model checkpoint for later inference or further training.
What happens:
- Use DeepSpeed checkpoint saving which handles ZeRO-3 state consolidation
- Save model weights, optimizer states, and training configuration
- Support resuming training from saved checkpoints