Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Deepspeedai DeepSpeed Pipeline Parallel Training

From Leeroopedia


Knowledge Sources
Domains Distributed_Training, Model_Parallelism, LLMs
Last Updated 2026-02-09 00:00 GMT

Overview

End-to-end process for training large models using DeepSpeed's pipeline parallelism, partitioning model layers across GPUs to enable training of models that exceed single-GPU memory capacity.

Description

This workflow covers training deep learning models using pipeline parallelism, where the model's layers are distributed across multiple GPUs in a staged pipeline. Each GPU processes a different subset of layers, and micro-batches flow through the pipeline stages. DeepSpeed's PipelineEngine implements efficient scheduling (1F1B - one forward, one backward) to maximize GPU utilization and minimize pipeline bubbles. The workflow covers wrapping a sequential model into a PipelineModule, configuring the pipeline topology, and running the distributed training loop. Pipeline parallelism can be combined with data parallelism and ZeRO optimization for maximum scalability.

Usage

Execute this workflow when your model has a naturally sequential layer structure and is too large to fit on a single GPU even with ZeRO Stage 3, or when you want to combine model parallelism with data parallelism for better throughput. Pipeline parallelism is especially effective for transformer-based models where layers can be cleanly partitioned. Use this when you have multiple GPUs and want to train models with tens of billions of parameters.

Execution Steps

Step 1: Model Architecture Design

Design the model as a sequence of layers that can be cleanly partitioned across pipeline stages. The model should be expressible as a torch.nn.Sequential or a list of layers. Each layer becomes a unit that can be assigned to a different GPU. Consider the compute and memory balance across stages to minimize pipeline bubbles.

Key considerations:

  • Layers should have uniform compute cost for balanced pipeline stages
  • Each layer must have compatible input/output tensor shapes for sequential execution
  • Embedding layers and output heads count as pipeline stages

Step 2: PipelineModule Construction

Wrap the sequential model layers into a DeepSpeed PipelineModule, specifying the number of pipeline stages. The PipelineModule handles automatic layer-to-stage assignment and manages the communication of activations between stages.

Key considerations:

  • PipelineModule accepts layers as a list or torch.nn.Sequential
  • num_stages determines how many GPUs are used for model parallelism
  • The module automatically assigns layers to stages based on compute balance
  • Custom partition schemes can be provided for non-uniform models

Step 3: Configuration and Initialization

Create a DeepSpeed configuration with pipeline-specific settings and call deepspeed.initialize() with the PipelineModule. When a PipelineModule is detected, DeepSpeed automatically creates a PipelineEngine instead of the standard DeepSpeedEngine. The engine manages the pipeline schedule, micro-batch routing, and gradient accumulation across stages.

Key considerations:

  • Do not pass mpu when using PipelineModule (the module creates its own)
  • Gradient accumulation steps determine the number of micro-batches in the pipeline
  • ZeRO Stage 0 or 1 is typically used with pipeline parallelism
  • The PipelineEngine handles its own training schedule internally

Step 4: Pipeline Training Loop

Execute training using the PipelineEngine's train_batch() method, which orchestrates the entire pipeline schedule for one global batch. The engine handles micro-batch splitting, forward passes through all stages, backward passes, and gradient synchronization automatically. Each GPU only computes its assigned pipeline stage.

Key considerations:

  • train_batch() handles the full forward-backward pipeline schedule
  • Communication of activations between stages happens automatically
  • Only the last stage computes the loss
  • Gradient accumulation across micro-batches is handled by the engine

Step 5: Checkpoint and Evaluation

Save checkpoints using the engine's save_checkpoint() method, which handles the per-stage model shards. For evaluation, the pipeline engine supports an inference schedule that performs only forward passes through the pipeline stages.

Key considerations:

  • Each pipeline stage saves its own layer weights
  • Pipeline checkpoints can be converted to standard format for deployment
  • Inference mode uses a separate schedule optimized for forward-only passes
  • Resuming training from checkpoints preserves the pipeline topology

Execution Diagram

GitHub URL

Workflow Repository