Implementation:Pytorch Serve ParallelType Config
| Field | Value |
|---|---|
| Page Type | Implementation (Pattern Doc) |
| Title | ParallelType Config |
| Implements | Principle:Pytorch_Serve_Parallelism_Strategy |
| Source | docs/large_model_inference.md:L27-36
|
| Repository | TorchServe |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
The parallelType configuration field in TorchServe's model-config.yaml determines which distributed inference strategy is used when serving a large model. This single parameter controls whether TorchServe uses torchrun to spawn multiple processes (for pipeline and tensor parallelism) or assigns all GPUs to a single process (for custom/Accelerate-based parallelism). It is the primary entry point for selecting a parallelism strategy.
Description
TorchServe supports four values for the parallelType field in the model configuration YAML:
"pp"-- Pipeline Parallelism. Uses PiPPy to split the model into sequential stages across GPUs. TorchServe launches one process per GPU viatorchrun. Suitable for models with clear sequential layer structures."tp"-- Tensor Parallelism. Uses DeepSpeed or PyTorch native tensor parallelism to shard individual layers across GPUs. TorchServe launches one process per GPU viatorchrun. Suitable for transformer models on high-bandwidth interconnects."pptp"-- Combined Pipeline and Tensor Parallelism. Applies both strategies simultaneously. TorchServe usestorchrunfor process management."custom"-- Custom / Single-Process. Leaves parallelization to the user or a library like HuggingFace Accelerate. GPUs are assigned to a single process withouttorchrun.
The number of GPUs allocated to a worker is controlled by either torchrun.nproc-per-node or parallelLevel. These two parameters are mutually exclusive and must not both be set.
GPU assignment follows a round-robin algorithm by default. With deviceIds, users can explicitly specify which GPUs a worker should use.
Usage
Code Reference
Source Location: docs/large_model_inference.md lines 27-36
Configuration Schema (model-config.yaml):
# frontend settings
minWorkers: 1
maxWorkers: 1
parallelType: "pp" # Options: "pp", "tp", "pptp", "custom"
deviceType: "gpu"
torchrun:
nproc-per-node: 4 # Number of GPUs / processes per worker
Alternative GPU specification using parallelLevel and deviceIds:
minWorkers: 1
maxWorkers: 1
parallelType: "tp"
deviceType: "gpu"
parallelLevel: 4 # Number of GPUs (mutually exclusive with nproc-per-node)
deviceIds: [2, 3, 4, 5] # Explicit GPU IDs (optional)
Key Parameters
| Parameter | Type | Description |
|---|---|---|
parallelType |
string | Parallelism strategy: "pp", "tp", "pptp", or "custom" |
parallelLevel |
int | Number of GPUs allocated to a worker. Mutually exclusive with nproc-per-node.
|
torchrun.nproc-per-node |
int | Number of processes torchrun starts per worker. Mutually exclusive with parallelLevel.
|
deviceIds |
list[int] | Explicit GPU device IDs to assign to workers. |
deviceType |
string | Device type, typically "gpu" for large model inference. |
minWorkers |
int | Minimum number of workers. Typically 1 for large models. |
maxWorkers |
int | Maximum number of workers. Typically 1 for large models. |
I/O Contract
Input: A model-config.yaml file included in the model archive (MAR) via torch-model-archiver --config-file model-config.yaml.
Output: TorchServe reads the configuration at model load time and:
- For
"pp","tp", or"pptp": launches the worker viatorchrunwith the specified number of processes, setting environment variablesLOCAL_RANK,WORLD_SIZE,RANK, andLOCAL_WORLD_SIZE. - For
"custom": launches a single worker process with all assigned GPUs visible viaCUDA_VISIBLE_DEVICES.
GPU Assignment Example:
Given 8 GPUs on a node, nproc-per-node=4, and minWorkers=2:
- Worker 1 receives
CUDA_VISIBLE_DEVICES="0,1,2,3" - Worker 2 receives
CUDA_VISIBLE_DEVICES="4,5,6,7"
Given deviceIds: [2,3,4,5] and nproc-per-node=2:
- Worker 1 receives
CUDA_VISIBLE_DEVICES="2,3" - Worker 2 receives
CUDA_VISIBLE_DEVICES="4,5"
Usage Examples
Pipeline parallelism with PiPPy on 4 GPUs:
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "pp"
deviceType: "gpu"
torchrun:
nproc-per-node: 4
pippy:
rpc_timeout: 1800
model_type: "HF"
chunks: 1
input_names: ["input_ids"]
handler:
model_path: "/path/to/model"
max_length: 50
Tensor parallelism with DeepSpeed on 4 GPUs:
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 120
parallelType: "tp"
deviceType: "gpu"
torchrun:
nproc-per-node: 4
deepspeed:
config: ds-config.json
handler:
model_path: "/path/to/model"
max_length: 80
Packaging a model with the config:
torch-model-archiver --model-name my_model \
--version 1.0 \
--handler custom_handler.py \
--extra-files /path/to/checkpoints \
-r requirements.txt \
--config-file model-config.yaml \
--archive-format tgz
Related Pages
- Principle:Pytorch_Serve_Parallelism_Strategy - Theory behind choosing a parallelism strategy
- Pytorch_Serve_BasePippyHandler - Handler for pipeline parallelism
- Pytorch_Serve_BaseDeepSpeedHandler - Handler for DeepSpeed tensor parallelism
- Pytorch_Serve_Accelerate_Handler - Handler for HuggingFace Accelerate
- Pytorch_Serve_Parallelism_Model_Config - Detailed model-config.yaml examples
- Pytorch_Serve_TorchModelServiceWorker - Worker process management for distributed inference
- Environment:Pytorch_Serve_Distributed_Training_Environment - Distributed env vars for parallelism
- Configuration
- Distributed_Computing