Implementation:Pytorch Serve ParallelType Config

Field	Value
Page Type	Implementation (Pattern Doc)
Title	ParallelType Config
Implements	Principle:Pytorch_Serve_Parallelism_Strategy
Source	`docs/large_model_inference.md:L27-36`
Repository	TorchServe
Last Updated	2026-02-13 00:00 GMT

Overview

The parallelType configuration field in TorchServe's model-config.yaml determines which distributed inference strategy is used when serving a large model. This single parameter controls whether TorchServe uses torchrun to spawn multiple processes (for pipeline and tensor parallelism) or assigns all GPUs to a single process (for custom/Accelerate-based parallelism). It is the primary entry point for selecting a parallelism strategy.

Description

TorchServe supports four values for the parallelType field in the model configuration YAML:

"pp" -- Pipeline Parallelism. Uses PiPPy to split the model into sequential stages across GPUs. TorchServe launches one process per GPU via torchrun. Suitable for models with clear sequential layer structures.
"tp" -- Tensor Parallelism. Uses DeepSpeed or PyTorch native tensor parallelism to shard individual layers across GPUs. TorchServe launches one process per GPU via torchrun. Suitable for transformer models on high-bandwidth interconnects.
"pptp" -- Combined Pipeline and Tensor Parallelism. Applies both strategies simultaneously. TorchServe uses torchrun for process management.
"custom" -- Custom / Single-Process. Leaves parallelization to the user or a library like HuggingFace Accelerate. GPUs are assigned to a single process without torchrun.

The number of GPUs allocated to a worker is controlled by either torchrun.nproc-per-node or parallelLevel. These two parameters are mutually exclusive and must not both be set.

GPU assignment follows a round-robin algorithm by default. With deviceIds, users can explicitly specify which GPUs a worker should use.

Usage

Code Reference

Source Location: docs/large_model_inference.md lines 27-36

Configuration Schema (model-config.yaml):

# frontend settings
minWorkers: 1
maxWorkers: 1
parallelType: "pp"     # Options: "pp", "tp", "pptp", "custom"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4  # Number of GPUs / processes per worker

Alternative GPU specification using parallelLevel and deviceIds:

minWorkers: 1
maxWorkers: 1
parallelType: "tp"
deviceType: "gpu"
parallelLevel: 4        # Number of GPUs (mutually exclusive with nproc-per-node)
deviceIds: [2, 3, 4, 5] # Explicit GPU IDs (optional)

Key Parameters

Parameter	Type	Description
`parallelType`	string	Parallelism strategy: "pp", "tp", "pptp", or "custom"
`parallelLevel`	int	Number of GPUs allocated to a worker. Mutually exclusive with `nproc-per-node`.
`torchrun.nproc-per-node`	int	Number of processes torchrun starts per worker. Mutually exclusive with `parallelLevel`.
`deviceIds`	list[int]	Explicit GPU device IDs to assign to workers.
`deviceType`	string	Device type, typically "gpu" for large model inference.
`minWorkers`	int	Minimum number of workers. Typically 1 for large models.
`maxWorkers`	int	Maximum number of workers. Typically 1 for large models.

I/O Contract

Input: A model-config.yaml file included in the model archive (MAR) via torch-model-archiver --config-file model-config.yaml.

Output: TorchServe reads the configuration at model load time and:

For "pp", "tp", or "pptp": launches the worker via torchrun with the specified number of processes, setting environment variables LOCAL_RANK, WORLD_SIZE, RANK, and LOCAL_WORLD_SIZE.
For "custom": launches a single worker process with all assigned GPUs visible via CUDA_VISIBLE_DEVICES.

GPU Assignment Example:

Given 8 GPUs on a node, nproc-per-node=4, and minWorkers=2:

Worker 1 receives CUDA_VISIBLE_DEVICES="0,1,2,3"
Worker 2 receives CUDA_VISIBLE_DEVICES="4,5,6,7"

Given deviceIds: [2,3,4,5] and nproc-per-node=2:

Worker 1 receives CUDA_VISIBLE_DEVICES="2,3"
Worker 2 receives CUDA_VISIBLE_DEVICES="4,5"

Usage Examples

Pipeline parallelism with PiPPy on 4 GPUs:

minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "pp"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4
pippy:
    rpc_timeout: 1800
    model_type: "HF"
    chunks: 1
    input_names: ["input_ids"]
handler:
    model_path: "/path/to/model"
    max_length: 50

Tensor parallelism with DeepSpeed on 4 GPUs:

minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 120
parallelType: "tp"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4
deepspeed:
    config: ds-config.json
handler:
    model_path: "/path/to/model"
    max_length: 80

Packaging a model with the config:

torch-model-archiver --model-name my_model \
    --version 1.0 \
    --handler custom_handler.py \
    --extra-files /path/to/checkpoints \
    -r requirements.txt \
    --config-file model-config.yaml \
    --archive-format tgz

Related Pages

Principle:Pytorch_Serve_Parallelism_Strategy - Theory behind choosing a parallelism strategy
Pytorch_Serve_BasePippyHandler - Handler for pipeline parallelism
Pytorch_Serve_BaseDeepSpeedHandler - Handler for DeepSpeed tensor parallelism
Pytorch_Serve_Accelerate_Handler - Handler for HuggingFace Accelerate
Pytorch_Serve_Parallelism_Model_Config - Detailed model-config.yaml examples
Pytorch_Serve_TorchModelServiceWorker - Worker process management for distributed inference
Environment:Pytorch_Serve_Distributed_Training_Environment - Distributed env vars for parallelism
Configuration
Distributed_Computing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment