Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Pytorch Serve Pipeline Parallelism

From Leeroopedia
Field Value
Page Type Principle
Title Pipeline Parallelism
Domains Distributed_Computing, Model_Serving
Knowledge Sources TorchServe
Last Updated 2026-02-13 00:00 GMT

Overview

Pipeline parallelism is a model parallelism strategy that splits a neural network into sequential stages, with each stage assigned to a different GPU. During inference, input data flows through the pipeline: GPU 0 processes stage 0 and passes the intermediate activations to GPU 1, which processes stage 1, and so on. This enables serving models whose total parameter size exceeds the memory capacity of any single GPU. TorchServe integrates PiPPy (PyTorch Pipelines), a PyTorch-native pipeline parallelism library, to provide this capability.

Description

In pipeline parallelism, a model with N layers is partitioned into K stages (where K is the number of available GPUs). Each stage contains approximately N/K layers. The stages form a sequential chain:

  1. Stage 0 (GPU 0) receives the original input, processes it through its assigned layers, and produces intermediate activations.
  2. Stage 1 (GPU 1) receives the activations from Stage 0, processes them, and passes the result forward.
  3. This continues until the final stage produces the output.

PiPPy (PyTorch Pipelines) automates this process for TorchServe:

  • It uses split_into_equal_size(world_size) to partition the model into equal-sized stages.
  • It employs FX tracing to analyze the model graph and determine the partition points.
  • For HuggingFace models, it uses PiPPyHFTracer to handle HuggingFace-specific model structures.
  • It uses pippy.all_compile() to compile the pipeline, distributing stages across the available ranks.
  • Inter-process communication is handled via PyTorch RPC (torch.distributed.rpc).

Microbatching is used to improve pipeline utilization. A batch of size B is split into C chunks (microbatches), and these chunks are fed into the pipeline in sequence. While chunk 1 is being processed by Stage 1, chunk 0 can already be processed by Stage 2, reducing pipeline idle time (bubble).

TorchServe uses torchrun to launch one process per GPU when parallelType: "pp" is configured. Each process initializes its RPC worker, and the pipeline driver coordinates execution across all stages.

Usage

To use pipeline parallelism in TorchServe:

  1. Create a custom handler that inherits from BasePippyHandler.
  2. In the handler's initialize() method, load the model and call get_pipeline_driver(model, world_size, ctx) to compile the pipeline.
  3. Configure model-config.yaml with parallelType: "pp" and set PiPPy-specific parameters under the pippy: section.
  4. Package the model with torch-model-archiver and deploy.

The key configuration parameters include:

  • pippy.rpc_timeout: RPC timeout for inter-process communication (in seconds).
  • pippy.chunks: Number of microbatches (microbatch_size = batch_size / chunks).
  • pippy.model_type: Set to "HF" for HuggingFace models to use the specialized tracer.
  • pippy.input_names: Input argument names for FX tracing (e.g., ["input_ids"]).

Theoretical Basis

Pipeline parallelism is rooted in the observation that deep neural networks are fundamentally sequential compositions of layers. By partitioning the layer sequence across devices, each device only needs to hold a fraction of the model parameters.

The key theoretical concept is the pipeline bubble. In a pipeline with K stages, the first K-1 microbatches experience partial pipeline utilization as stages wait for upstream results. The bubble ratio is approximately (K-1) / (K-1+C) where C is the number of microbatches. Increasing C (more microbatches) reduces the bubble but increases memory usage for stored activations.

The FillDrain schedule used by PiPPy is the simplest pipeline schedule:

  • Fill phase: Microbatches enter the pipeline one at a time, progressively activating more stages.
  • Drain phase: After all microbatches have entered, stages complete processing and the pipeline drains.

Communication in PiPPy uses PyTorch RPC (Remote Procedure Calls) with the TensorPipe backend. Each worker is assigned a rank and communicates activations to the next rank. The set_device_map configuration ensures that tensors are automatically transferred between the correct GPU devices during RPC calls.

For HuggingFace models, the PiPPyHFTracer extends FX tracing to handle model-specific patterns such as conditional branches, optional arguments, and custom attention implementations that standard FX tracing cannot handle.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment