Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Pytorch Serve Parallelism Strategy

From Leeroopedia
Field Value
Page Type Principle
Title Parallelism Strategy Selection
Domains Distributed_Computing, Model_Serving
Knowledge Sources TorchServe
Last Updated 2026-02-13 00:00 GMT

Overview

This principle covers choosing the right distributed inference strategy for large model serving in TorchServe. When a model is too large to fit into a single GPU, it must be partitioned across multiple GPUs using one of several parallelism strategies. The selection of strategy depends on model architecture, available GPU count, memory constraints, and latency requirements. TorchServe supports pipeline parallelism (PiPPy), tensor parallelism (DeepSpeed, PyTorch native TP), combined pipeline and tensor parallelism, and automatic device mapping (HuggingFace Accelerate).

Description

Large model inference requires splitting a model across multiple GPUs because the model parameters exceed the memory capacity of a single device. TorchServe provides a flexible configuration-driven approach to selecting the parallelism strategy through the parallelType field in the model configuration YAML.

The four primary strategies are:

Pipeline Parallelism (PP): The model is split into sequential stages, with each GPU responsible for one stage. Activations flow from one GPU to the next in a pipeline fashion. This approach works well for models with clear sequential layer structures and is implemented via PiPPy. TorchServe uses torchrun to spawn one process per GPU, and RPC is used for inter-process communication.

Tensor Parallelism (TP): Individual layers are sharded across multiple GPUs, with each GPU holding a fraction of the weight matrices. All GPUs compute in parallel on the same layer, then synchronize results. This is implemented via DeepSpeed inference or PyTorch native tensor parallelism. TorchServe uses torchrun to manage the distributed processes.

Combined Pipeline and Tensor Parallelism (PPTP): A hybrid approach that applies both pipeline and tensor parallelism. This is suitable for extremely large models where neither strategy alone provides sufficient distribution. TorchServe also uses torchrun for this mode.

Custom / Automatic Device Mapping: A single-process approach where a library such as HuggingFace Accelerate automatically distributes model layers across available GPUs (and optionally CPU or disk) based on memory constraints. This does not require torchrun and operates within a single worker process.

The key trade-offs involve:

  • Pipeline parallelism introduces pipeline latency (sequential stage execution) but has low communication overhead between stages.
  • Tensor parallelism requires all-reduce synchronization at every layer boundary but allows all GPUs to work concurrently on each token.
  • Automatic device mapping is the simplest to configure but may not achieve optimal throughput since it runs in a single process.

Usage

Select the appropriate parallelism strategy based on these considerations:

  1. Model fits on 1 GPU but benefits from faster generation: No parallelism needed; use a standard single-GPU handler.
  2. Model exceeds single GPU memory, clear layer structure: Use pipeline parallelism (parallelType: "pp") with PiPPy.
  3. Model exceeds single GPU memory, transformer architecture: Use tensor parallelism (parallelType: "tp") with DeepSpeed or PyTorch TP.
  4. Extremely large model requiring both strategies: Use combined parallelism (parallelType: "pptp").
  5. Simplest setup with HuggingFace models: Use custom/Accelerate (parallelType: "custom" or omit) with device_map="auto".

The parallelType setting in model-config.yaml determines whether TorchServe uses torchrun (for "pp", "tp", "pptp") or a single process (for "custom"). The number of GPUs is controlled by torchrun.nproc-per-node or parallelLevel, but these two parameters must not be set simultaneously.

Theoretical Basis

Distributed model inference is grounded in the partitioning of neural network computation across multiple processing units. The two fundamental dimensions of model parallelism are:

Data parallelism replicates the full model on each device and partitions the input batch. This does not help when a single model copy exceeds GPU memory.

Model parallelism partitions the model itself. Pipeline parallelism partitions along the depth (layer) dimension, while tensor parallelism partitions along the width (parameter) dimension of individual layers.

Pipeline parallelism introduces pipeline bubbles -- idle time at the start and end of processing when not all stages are active. Microbatching (splitting batches into smaller chunks) helps mitigate this by keeping more stages active simultaneously.

Tensor parallelism requires collective communication (typically all-reduce) at each layer boundary. The communication cost scales with the hidden dimension size and the number of GPUs, making it most efficient on high-bandwidth interconnects (e.g., NVLink).

The choice between strategies depends on the model architecture (some models shard more naturally along one dimension), the GPU interconnect bandwidth, and the inference latency requirements.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment