Workflow:Deepspeedai DeepSpeed Pipeline Parallel Training

Knowledge Sources	DeepSpeed DeepSpeed Pipeline Parallelism DeepSpeed Configuration
Domains	Distributed_Training, Model_Parallelism, LLMs
Last Updated	2026-02-09 00:00 GMT

Overview

End-to-end process for training large models using DeepSpeed's pipeline parallelism, partitioning model layers across GPUs to enable training of models that exceed single-GPU memory capacity.

Description

This workflow covers training deep learning models using pipeline parallelism, where the model's layers are distributed across multiple GPUs in a staged pipeline. Each GPU processes a different subset of layers, and micro-batches flow through the pipeline stages. DeepSpeed's PipelineEngine implements efficient scheduling (1F1B - one forward, one backward) to maximize GPU utilization and minimize pipeline bubbles. The workflow covers wrapping a sequential model into a PipelineModule, configuring the pipeline topology, and running the distributed training loop. Pipeline parallelism can be combined with data parallelism and ZeRO optimization for maximum scalability.

Usage

Execute this workflow when your model has a naturally sequential layer structure and is too large to fit on a single GPU even with ZeRO Stage 3, or when you want to combine model parallelism with data parallelism for better throughput. Pipeline parallelism is especially effective for transformer-based models where layers can be cleanly partitioned. Use this when you have multiple GPUs and want to train models with tens of billions of parameters.

Execution Steps

Step 1: Model Architecture Design

Design the model as a sequence of layers that can be cleanly partitioned across pipeline stages. The model should be expressible as a torch.nn.Sequential or a list of layers. Each layer becomes a unit that can be assigned to a different GPU. Consider the compute and memory balance across stages to minimize pipeline bubbles.

Key considerations:

Layers should have uniform compute cost for balanced pipeline stages
Each layer must have compatible input/output tensor shapes for sequential execution
Embedding layers and output heads count as pipeline stages

Step 2: PipelineModule Construction

Wrap the sequential model layers into a DeepSpeed PipelineModule, specifying the number of pipeline stages. The PipelineModule handles automatic layer-to-stage assignment and manages the communication of activations between stages.

Key considerations:

PipelineModule accepts layers as a list or torch.nn.Sequential
num_stages determines how many GPUs are used for model parallelism
The module automatically assigns layers to stages based on compute balance
Custom partition schemes can be provided for non-uniform models

Step 3: Configuration and Initialization

Create a DeepSpeed configuration with pipeline-specific settings and call deepspeed.initialize() with the PipelineModule. When a PipelineModule is detected, DeepSpeed automatically creates a PipelineEngine instead of the standard DeepSpeedEngine. The engine manages the pipeline schedule, micro-batch routing, and gradient accumulation across stages.

Key considerations:

Do not pass mpu when using PipelineModule (the module creates its own)
Gradient accumulation steps determine the number of micro-batches in the pipeline
ZeRO Stage 0 or 1 is typically used with pipeline parallelism
The PipelineEngine handles its own training schedule internally

Step 4: Pipeline Training Loop

Execute training using the PipelineEngine's train_batch() method, which orchestrates the entire pipeline schedule for one global batch. The engine handles micro-batch splitting, forward passes through all stages, backward passes, and gradient synchronization automatically. Each GPU only computes its assigned pipeline stage.

Key considerations:

train_batch() handles the full forward-backward pipeline schedule
Communication of activations between stages happens automatically
Only the last stage computes the loss
Gradient accumulation across micro-batches is handled by the engine

Step 5: Checkpoint and Evaluation

Save checkpoints using the engine's save_checkpoint() method, which handles the per-stage model shards. For evaluation, the pipeline engine supports an inference schedule that performs only forward passes through the pipeline stages.

Key considerations:

Each pipeline stage saves its own layer weights
Pipeline checkpoints can be converted to standard format for deployment
Inference mode uses a separate schedule optimized for forward-only passes
Resuming training from checkpoints preserves the pipeline topology

Execution Diagram

GitHub URL

Workflow Repository