Principle:Pytorch Serve Pipeline Parallelism

Field	Value
Page Type	Principle
Title	Pipeline Parallelism
Domains	Distributed_Computing, Model_Serving
Knowledge Sources	TorchServe
Last Updated	2026-02-13 00:00 GMT

Overview

Pipeline parallelism is a model parallelism strategy that splits a neural network into sequential stages, with each stage assigned to a different GPU. During inference, input data flows through the pipeline: GPU 0 processes stage 0 and passes the intermediate activations to GPU 1, which processes stage 1, and so on. This enables serving models whose total parameter size exceeds the memory capacity of any single GPU. TorchServe integrates PiPPy (PyTorch Pipelines), a PyTorch-native pipeline parallelism library, to provide this capability.

Description

In pipeline parallelism, a model with N layers is partitioned into K stages (where K is the number of available GPUs). Each stage contains approximately N/K layers. The stages form a sequential chain:

Stage 0 (GPU 0) receives the original input, processes it through its assigned layers, and produces intermediate activations.
Stage 1 (GPU 1) receives the activations from Stage 0, processes them, and passes the result forward.
This continues until the final stage produces the output.

PiPPy (PyTorch Pipelines) automates this process for TorchServe:

It uses split_into_equal_size(world_size) to partition the model into equal-sized stages.
It employs FX tracing to analyze the model graph and determine the partition points.
For HuggingFace models, it uses PiPPyHFTracer to handle HuggingFace-specific model structures.
It uses pippy.all_compile() to compile the pipeline, distributing stages across the available ranks.
Inter-process communication is handled via PyTorch RPC (torch.distributed.rpc).

Microbatching is used to improve pipeline utilization. A batch of size B is split into C chunks (microbatches), and these chunks are fed into the pipeline in sequence. While chunk 1 is being processed by Stage 1, chunk 0 can already be processed by Stage 2, reducing pipeline idle time (bubble).

TorchServe uses torchrun to launch one process per GPU when parallelType: "pp" is configured. Each process initializes its RPC worker, and the pipeline driver coordinates execution across all stages.

Usage

To use pipeline parallelism in TorchServe:

Create a custom handler that inherits from BasePippyHandler.
In the handler's initialize() method, load the model and call get_pipeline_driver(model, world_size, ctx) to compile the pipeline.
Configure model-config.yaml with parallelType: "pp" and set PiPPy-specific parameters under the pippy: section.
Package the model with torch-model-archiver and deploy.

The key configuration parameters include:

pippy.rpc_timeout: RPC timeout for inter-process communication (in seconds).
pippy.chunks: Number of microbatches (microbatch_size = batch_size / chunks).
pippy.model_type: Set to "HF" for HuggingFace models to use the specialized tracer.
pippy.input_names: Input argument names for FX tracing (e.g., ["input_ids"]).

Theoretical Basis

Pipeline parallelism is rooted in the observation that deep neural networks are fundamentally sequential compositions of layers. By partitioning the layer sequence across devices, each device only needs to hold a fraction of the model parameters.

The key theoretical concept is the pipeline bubble. In a pipeline with K stages, the first K-1 microbatches experience partial pipeline utilization as stages wait for upstream results. The bubble ratio is approximately (K-1) / (K-1+C) where C is the number of microbatches. Increasing C (more microbatches) reduces the bubble but increases memory usage for stored activations.

The FillDrain schedule used by PiPPy is the simplest pipeline schedule:

Fill phase: Microbatches enter the pipeline one at a time, progressively activating more stages.
Drain phase: After all microbatches have entered, stages complete processing and the pipeline drains.

Communication in PiPPy uses PyTorch RPC (Remote Procedure Calls) with the TensorPipe backend. Each worker is assigned a rank and communicates activations to the next rank. The set_device_map configuration ensures that tensors are automatically transferred between the correct GPU devices during RPC calls.

For HuggingFace models, the PiPPyHFTracer extends FX tracing to handle model-specific patterns such as conditional branches, optional arguments, and custom attention implementations that standard FX tracing cannot handle.

Related Pages

Implementation:Pytorch_Serve_BasePippyHandler - PiPPy handler base class and pipeline driver utilities
Pytorch_Serve_Parallelism_Strategy - Choosing between parallelism strategies
Pytorch_Serve_Distributed_Configuration - Configuring distributed serving parameters
Pytorch_Serve_Distributed_Worker - Managing worker processes for distributed inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment