Environment:Pytorch Serve Distributed Training Environment

Knowledge Sources	Pytorch_Serve PyTorch Distributed
Domains	Distributed_Inference, Infrastructure
Last Updated	2026-02-13 00:00 GMT

Overview

Distributed inference environment with multi-GPU coordination via PyTorch distributed, PiPPy pipeline parallelism, and NCCL backend.

Description

This environment defines the prerequisites for running TorchServe in distributed mode across multiple GPUs or nodes. It encompasses PiPPy (Pipeline Parallelism), PyTorch native tensor parallelism, and DeepSpeed distributed setups. The environment relies on PyTorch's distributed process group (NCCL backend) and a set of environment variables (LOCAL_RANK, WORLD_SIZE, RANK) that TorchServe's worker processes read at startup. PiPPy additionally requires RPC-based device mapping across CUDA devices.

Usage

Use this environment when serving large models that require distributed inference across multiple GPUs. This is the prerequisite for the Large Model Inference workflow steps involving PiPPy, DeepSpeed, or native tensor parallelism.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	Multi-GPU distributed only on Linux
Hardware	2+ NVIDIA GPUs	NVLink or PCIe interconnect; NVLink preferred for tensor parallelism
Network	Low-latency interconnect	For multi-node: InfiniBand or high-speed Ethernet
VRAM	Model-dependent	Total VRAM across GPUs must fit the model

Dependencies

System Packages

NVIDIA GPU driver >= 450
CUDA Toolkit >= 11.0
NCCL 2.x (for GPU-to-GPU communication)

Python Packages

`torch` with CUDA and distributed support
`pippy` (optional, for pipeline parallelism): `torchpippy` == 0.1.1
`deepspeed` (optional, for DeepSpeed parallelism)
`torchserve`

Credentials

The following environment variables are required for distributed operation:

`LOCAL_RANK`: Local rank of this process on the current node. Required by PiPPy handler (no default). Default 0 for DeepSpeed.
`WORLD_SIZE`: Total number of processes across all nodes. Required by PiPPy handler (no default).
`RANK`: Global rank of this process (default: 0).
`LOCAL_WORLD_SIZE`: Number of processes per node (default: 0).

Quick Install

# Install PyTorch with CUDA and distributed support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For PiPPy pipeline parallelism
pip install torchpippy==0.1.1

# For DeepSpeed
pip install deepspeed

# Install TorchServe
pip install torchserve torch-model-archiver

Code Evidence

PiPPy handler reading required environment variables from `ts/torch_handler/distributed/base_pippy_handler.py:19-22`:

self.local_rank = int(os.environ["LOCAL_RANK"])
self.world_size = int(os.environ["WORLD_SIZE"])
n_devs = torch.cuda.device_count()
self.device = self.local_rank % n_devs

PiPPy RPC device mapping from `ts/handler_utils/distributed/pt_pippy.py:29-32`:

n_devs = torch.cuda.device_count()
options = rpc.TensorPipeRpcBackendOptions(
    num_worker_threads=256, _transports=["uv"]
)
for i in range(n_devs):
    options.set_device_map(f"worker{i}", {i: i})

PiPPy availability check from `ts/handler_utils/distributed/pt_pippy.py:10-14`:

pippy_installed = importlib.util.find_spec("pippy") is not None
if pippy_installed:
    from pippy import split_into_equal_size
    from pippy.PipelineStage import PipelineStage

NCCL initialization for tensor parallelism from `examples/large_models/tp_llama/llama_handler.py:56-57`:

if not torch.distributed.is_initialized():
    torch.distributed.init_process_group("nccl")

Common Errors

Error Message	Cause	Solution
`KeyError: 'LOCAL_RANK'`	LOCAL_RANK not set (PiPPy requires it)	Set LOCAL_RANK via TorchServe model config `parallelLevel` or manually
`NCCL error: unhandled system error`	Multi-GPU communication failure	Verify NCCL installation and GPU visibility with `nvidia-smi`
`RuntimeError: No module named 'pippy'`	PiPPy not installed	`pip install torchpippy==0.1.1`
Worker processes hang on startup	Process group initialization timeout	Increase startup timeout; check all workers can reach each other

Compatibility Notes

PiPPy: Requires `pippy` package. Uses RPC-based device mapping for pipeline stages. Workers communicate via TensorPipe transport.
Tensor Parallelism: PyTorch native TP (used by tp_llama example) requires NCCL backend and `torch.distributed.init_process_group("nccl")`.
DeepSpeed: Uses its own distributed initialization. See `Pytorch_Serve_DeepSpeed_Environment` for details.
TorchServe integration: `parallelLevel` in model-config.yaml controls the number of worker processes per model, which maps to distributed world size.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment