Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Microsoft DeepSpeedExamples SuperOffload Finetuning

From Leeroopedia


Knowledge Sources
Domains LLMs, Fine_Tuning, Memory_Optimization, Distributed_Training
Last Updated 2026-02-07 13:00 GMT

Overview

End-to-end process for fine-tuning large language models (8B to 70B+ parameters) on limited GPU hardware using DeepSpeed ZeRO Stage 3 with SuperOffload CPU offloading optimization.

Description

This workflow enables fine-tuning of large language models that would not normally fit on available GPU hardware. It uses DeepSpeed ZeRO Stage 3 to partition model states (parameters, gradients, optimizer states) across GPUs and CPU memory, with SuperOffload providing optimized CPU offloading for NVIDIA Superchip architectures (GH200/GB200) and AMD MI300A.

Goal: A fine-tuned language model adapted to a specific domain or instruction-following task, trained from a pretrained base model on custom data.

Scope: Covers model loading, dataset preparation (Alpaca format), DeepSpeed ZeRO-3 configuration with CPU offloading, training loop execution, and checkpoint saving.

Strategy: Uses ZeRO Stage 3 to partition all model states across available devices, with SuperOffload optimizing CPU-GPU data transfer. Combines gradient checkpointing for activation memory reduction with DeepSpeedCPUAdam for efficient CPU-based optimizer computation.

Usage

Execute this workflow when you need to fine-tune a large language model (8B-70B+ parameters) but have limited GPU memory. This is designed for scenarios with 1-4 GPUs where the model would not fit using standard approaches. Particularly optimized for NVIDIA GH200/GB200 Superchips with unified CPU-GPU memory, but works on any hardware with sufficient combined CPU+GPU memory.

Execution Steps

Step 1: Environment Setup

Configure the training environment with DeepSpeed, appropriate CUDA settings, and NUMA bindings for optimal CPU offloading performance.

Key considerations:

  • Install DeepSpeed with CPU Adam support
  • For GH200/GB200 systems, configure NUMA bindings for optimal memory access patterns
  • Set appropriate CUDA memory allocation settings to prevent fragmentation
  • Configure WandB (Weights & Biases) for experiment tracking if desired

Step 2: Dataset Preparation

Load and preprocess the training dataset into the instruction-following format expected by the model.

What happens:

  • Load the Alpaca instruction-following dataset from HuggingFace
  • Format examples into instruction/input/output structure
  • Tokenize with appropriate max sequence length (2048 tokens)
  • Create a DataLoader with distributed sampling for multi-GPU training

Step 3: Model Loading

Load the pretrained model with ZeRO Stage 3 configuration for memory-efficient initialization.

What happens:

  • Load tokenizer from HuggingFace model hub
  • Load pretrained model (e.g., LLaMA-3.1-8B, Phi-4, Qwen3-14B) with appropriate dtype
  • Enable gradient checkpointing to reduce activation memory by recomputing activations during backward pass
  • Configure DeepSpeedCPUAdam optimizer for efficient CPU-based parameter updates

Step 4: DeepSpeed Initialization

Initialize the DeepSpeed engine with ZeRO Stage 3 and SuperOffload configuration.

What happens:

  • Configure ZeRO Stage 3 with parameter, gradient, and optimizer state partitioning
  • Enable CPU offloading for optimizer states (90% offload ratio) and parameters
  • Set pin_memory for faster CPU-GPU transfers
  • Initialize DeepSpeed engine wrapping the model, optimizer, and data loader
  • Configure mixed precision training (bf16)

Step 5: Training Loop

Execute the fine-tuning training loop with loss computation and gradient updates.

What happens:

  • Iterate over the dataset for the configured number of epochs
  • Compute causal language modeling loss on instruction-response sequences
  • DeepSpeed handles gradient accumulation, communication, and optimizer steps automatically
  • Log training metrics (loss, throughput in TFLOPS) at regular intervals
  • Optionally sync metrics with WandB for experiment tracking

Step 6: Checkpoint Saving

Save the fine-tuned model checkpoint for later inference or further training.

What happens:

  • Use DeepSpeed checkpoint saving which handles ZeRO-3 state consolidation
  • Save model weights, optimizer states, and training configuration
  • Support resuming training from saved checkpoints

Execution Diagram

GitHub URL

Workflow Repository