Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Pytorch Serve ParallelType Config

From Leeroopedia
Field Value
Page Type Implementation (Pattern Doc)
Title ParallelType Config
Implements Principle:Pytorch_Serve_Parallelism_Strategy
Source docs/large_model_inference.md:L27-36
Repository TorchServe
Last Updated 2026-02-13 00:00 GMT

Overview

The parallelType configuration field in TorchServe's model-config.yaml determines which distributed inference strategy is used when serving a large model. This single parameter controls whether TorchServe uses torchrun to spawn multiple processes (for pipeline and tensor parallelism) or assigns all GPUs to a single process (for custom/Accelerate-based parallelism). It is the primary entry point for selecting a parallelism strategy.

Description

TorchServe supports four values for the parallelType field in the model configuration YAML:

  • "pp" -- Pipeline Parallelism. Uses PiPPy to split the model into sequential stages across GPUs. TorchServe launches one process per GPU via torchrun. Suitable for models with clear sequential layer structures.
  • "tp" -- Tensor Parallelism. Uses DeepSpeed or PyTorch native tensor parallelism to shard individual layers across GPUs. TorchServe launches one process per GPU via torchrun. Suitable for transformer models on high-bandwidth interconnects.
  • "pptp" -- Combined Pipeline and Tensor Parallelism. Applies both strategies simultaneously. TorchServe uses torchrun for process management.
  • "custom" -- Custom / Single-Process. Leaves parallelization to the user or a library like HuggingFace Accelerate. GPUs are assigned to a single process without torchrun.

The number of GPUs allocated to a worker is controlled by either torchrun.nproc-per-node or parallelLevel. These two parameters are mutually exclusive and must not both be set.

GPU assignment follows a round-robin algorithm by default. With deviceIds, users can explicitly specify which GPUs a worker should use.

Usage

Code Reference

Source Location: docs/large_model_inference.md lines 27-36

Configuration Schema (model-config.yaml):

# frontend settings
minWorkers: 1
maxWorkers: 1
parallelType: "pp"     # Options: "pp", "tp", "pptp", "custom"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4  # Number of GPUs / processes per worker

Alternative GPU specification using parallelLevel and deviceIds:

minWorkers: 1
maxWorkers: 1
parallelType: "tp"
deviceType: "gpu"
parallelLevel: 4        # Number of GPUs (mutually exclusive with nproc-per-node)
deviceIds: [2, 3, 4, 5] # Explicit GPU IDs (optional)

Key Parameters

Parameter Type Description
parallelType string Parallelism strategy: "pp", "tp", "pptp", or "custom"
parallelLevel int Number of GPUs allocated to a worker. Mutually exclusive with nproc-per-node.
torchrun.nproc-per-node int Number of processes torchrun starts per worker. Mutually exclusive with parallelLevel.
deviceIds list[int] Explicit GPU device IDs to assign to workers.
deviceType string Device type, typically "gpu" for large model inference.
minWorkers int Minimum number of workers. Typically 1 for large models.
maxWorkers int Maximum number of workers. Typically 1 for large models.

I/O Contract

Input: A model-config.yaml file included in the model archive (MAR) via torch-model-archiver --config-file model-config.yaml.

Output: TorchServe reads the configuration at model load time and:

  • For "pp", "tp", or "pptp": launches the worker via torchrun with the specified number of processes, setting environment variables LOCAL_RANK, WORLD_SIZE, RANK, and LOCAL_WORLD_SIZE.
  • For "custom": launches a single worker process with all assigned GPUs visible via CUDA_VISIBLE_DEVICES.

GPU Assignment Example:

Given 8 GPUs on a node, nproc-per-node=4, and minWorkers=2:

  • Worker 1 receives CUDA_VISIBLE_DEVICES="0,1,2,3"
  • Worker 2 receives CUDA_VISIBLE_DEVICES="4,5,6,7"

Given deviceIds: [2,3,4,5] and nproc-per-node=2:

  • Worker 1 receives CUDA_VISIBLE_DEVICES="2,3"
  • Worker 2 receives CUDA_VISIBLE_DEVICES="4,5"

Usage Examples

Pipeline parallelism with PiPPy on 4 GPUs:

minWorkers: 1
maxWorkers: 1
maxBatchDelay: 200
responseTimeout: 300
parallelType: "pp"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4
pippy:
    rpc_timeout: 1800
    model_type: "HF"
    chunks: 1
    input_names: ["input_ids"]
handler:
    model_path: "/path/to/model"
    max_length: 50

Tensor parallelism with DeepSpeed on 4 GPUs:

minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 120
parallelType: "tp"
deviceType: "gpu"
torchrun:
    nproc-per-node: 4
deepspeed:
    config: ds-config.json
handler:
    model_path: "/path/to/model"
    max_length: 80

Packaging a model with the config:

torch-model-archiver --model-name my_model \
    --version 1.0 \
    --handler custom_handler.py \
    --extra-files /path/to/checkpoints \
    -r requirements.txt \
    --config-file model-config.yaml \
    --archive-format tgz

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment