Workflow:Microsoft DeepSpeedExamples SuperOffload Finetuning

Knowledge Sources	DeepSpeedExamples DeepSpeed Docs
Domains	LLMs, Fine_Tuning, Memory_Optimization, Distributed_Training
Last Updated	2026-02-07 13:00 GMT

Overview

End-to-end process for fine-tuning large language models (8B to 70B+ parameters) on limited GPU hardware using DeepSpeed ZeRO Stage 3 with SuperOffload CPU offloading optimization.

Description

This workflow enables fine-tuning of large language models that would not normally fit on available GPU hardware. It uses DeepSpeed ZeRO Stage 3 to partition model states (parameters, gradients, optimizer states) across GPUs and CPU memory, with SuperOffload providing optimized CPU offloading for NVIDIA Superchip architectures (GH200/GB200) and AMD MI300A.

Goal: A fine-tuned language model adapted to a specific domain or instruction-following task, trained from a pretrained base model on custom data.

Scope: Covers model loading, dataset preparation (Alpaca format), DeepSpeed ZeRO-3 configuration with CPU offloading, training loop execution, and checkpoint saving.

Strategy: Uses ZeRO Stage 3 to partition all model states across available devices, with SuperOffload optimizing CPU-GPU data transfer. Combines gradient checkpointing for activation memory reduction with DeepSpeedCPUAdam for efficient CPU-based optimizer computation.

Usage

Execute this workflow when you need to fine-tune a large language model (8B-70B+ parameters) but have limited GPU memory. This is designed for scenarios with 1-4 GPUs where the model would not fit using standard approaches. Particularly optimized for NVIDIA GH200/GB200 Superchips with unified CPU-GPU memory, but works on any hardware with sufficient combined CPU+GPU memory.

Execution Steps

Step 1: Environment Setup

Configure the training environment with DeepSpeed, appropriate CUDA settings, and NUMA bindings for optimal CPU offloading performance.

Key considerations:

Install DeepSpeed with CPU Adam support
For GH200/GB200 systems, configure NUMA bindings for optimal memory access patterns
Set appropriate CUDA memory allocation settings to prevent fragmentation
Configure WandB (Weights & Biases) for experiment tracking if desired

Step 2: Dataset Preparation

Load and preprocess the training dataset into the instruction-following format expected by the model.

What happens:

Load the Alpaca instruction-following dataset from HuggingFace
Format examples into instruction/input/output structure
Tokenize with appropriate max sequence length (2048 tokens)
Create a DataLoader with distributed sampling for multi-GPU training

Step 3: Model Loading

Load the pretrained model with ZeRO Stage 3 configuration for memory-efficient initialization.

What happens:

Load tokenizer from HuggingFace model hub
Load pretrained model (e.g., LLaMA-3.1-8B, Phi-4, Qwen3-14B) with appropriate dtype
Enable gradient checkpointing to reduce activation memory by recomputing activations during backward pass
Configure DeepSpeedCPUAdam optimizer for efficient CPU-based parameter updates

Step 4: DeepSpeed Initialization

Initialize the DeepSpeed engine with ZeRO Stage 3 and SuperOffload configuration.

What happens:

Configure ZeRO Stage 3 with parameter, gradient, and optimizer state partitioning
Enable CPU offloading for optimizer states (90% offload ratio) and parameters
Set pin_memory for faster CPU-GPU transfers
Initialize DeepSpeed engine wrapping the model, optimizer, and data loader
Configure mixed precision training (bf16)

Step 5: Training Loop

Execute the fine-tuning training loop with loss computation and gradient updates.

What happens:

Iterate over the dataset for the configured number of epochs
Compute causal language modeling loss on instruction-response sequences
DeepSpeed handles gradient accumulation, communication, and optimizer steps automatically
Log training metrics (loss, throughput in TFLOPS) at regular intervals
Optionally sync metrics with WandB for experiment tracking

Step 6: Checkpoint Saving

Save the fine-tuned model checkpoint for later inference or further training.

What happens:

Use DeepSpeed checkpoint saving which handles ZeRO-3 state consolidation
Save model weights, optimizer states, and training configuration
Support resuming training from saved checkpoints

Execution Diagram

GitHub URL

Workflow Repository