Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Pytorch Serve Distributed Training Environment

From Leeroopedia
Knowledge Sources
Domains Distributed_Inference, Infrastructure
Last Updated 2026-02-13 00:00 GMT

Overview

Distributed inference environment with multi-GPU coordination via PyTorch distributed, PiPPy pipeline parallelism, and NCCL backend.

Description

This environment defines the prerequisites for running TorchServe in distributed mode across multiple GPUs or nodes. It encompasses PiPPy (Pipeline Parallelism), PyTorch native tensor parallelism, and DeepSpeed distributed setups. The environment relies on PyTorch's distributed process group (NCCL backend) and a set of environment variables (LOCAL_RANK, WORLD_SIZE, RANK) that TorchServe's worker processes read at startup. PiPPy additionally requires RPC-based device mapping across CUDA devices.

Usage

Use this environment when serving large models that require distributed inference across multiple GPUs. This is the prerequisite for the Large Model Inference workflow steps involving PiPPy, DeepSpeed, or native tensor parallelism.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+) Multi-GPU distributed only on Linux
Hardware 2+ NVIDIA GPUs NVLink or PCIe interconnect; NVLink preferred for tensor parallelism
Network Low-latency interconnect For multi-node: InfiniBand or high-speed Ethernet
VRAM Model-dependent Total VRAM across GPUs must fit the model

Dependencies

System Packages

  • NVIDIA GPU driver >= 450
  • CUDA Toolkit >= 11.0
  • NCCL 2.x (for GPU-to-GPU communication)

Python Packages

  • `torch` with CUDA and distributed support
  • `pippy` (optional, for pipeline parallelism): `torchpippy` == 0.1.1
  • `deepspeed` (optional, for DeepSpeed parallelism)
  • `torchserve`

Credentials

The following environment variables are required for distributed operation:

  • `LOCAL_RANK`: Local rank of this process on the current node. Required by PiPPy handler (no default). Default 0 for DeepSpeed.
  • `WORLD_SIZE`: Total number of processes across all nodes. Required by PiPPy handler (no default).
  • `RANK`: Global rank of this process (default: 0).
  • `LOCAL_WORLD_SIZE`: Number of processes per node (default: 0).

Quick Install

# Install PyTorch with CUDA and distributed support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For PiPPy pipeline parallelism
pip install torchpippy==0.1.1

# For DeepSpeed
pip install deepspeed

# Install TorchServe
pip install torchserve torch-model-archiver

Code Evidence

PiPPy handler reading required environment variables from `ts/torch_handler/distributed/base_pippy_handler.py:19-22`:

self.local_rank = int(os.environ["LOCAL_RANK"])
self.world_size = int(os.environ["WORLD_SIZE"])
n_devs = torch.cuda.device_count()
self.device = self.local_rank % n_devs

PiPPy RPC device mapping from `ts/handler_utils/distributed/pt_pippy.py:29-32`:

n_devs = torch.cuda.device_count()
options = rpc.TensorPipeRpcBackendOptions(
    num_worker_threads=256, _transports=["uv"]
)
for i in range(n_devs):
    options.set_device_map(f"worker{i}", {i: i})

PiPPy availability check from `ts/handler_utils/distributed/pt_pippy.py:10-14`:

pippy_installed = importlib.util.find_spec("pippy") is not None
if pippy_installed:
    from pippy import split_into_equal_size
    from pippy.PipelineStage import PipelineStage

NCCL initialization for tensor parallelism from `examples/large_models/tp_llama/llama_handler.py:56-57`:

if not torch.distributed.is_initialized():
    torch.distributed.init_process_group("nccl")

Common Errors

Error Message Cause Solution
`KeyError: 'LOCAL_RANK'` LOCAL_RANK not set (PiPPy requires it) Set LOCAL_RANK via TorchServe model config `parallelLevel` or manually
`NCCL error: unhandled system error` Multi-GPU communication failure Verify NCCL installation and GPU visibility with `nvidia-smi`
`RuntimeError: No module named 'pippy'` PiPPy not installed `pip install torchpippy==0.1.1`
Worker processes hang on startup Process group initialization timeout Increase startup timeout; check all workers can reach each other

Compatibility Notes

  • PiPPy: Requires `pippy` package. Uses RPC-based device mapping for pipeline stages. Workers communicate via TensorPipe transport.
  • Tensor Parallelism: PyTorch native TP (used by tp_llama example) requires NCCL backend and `torch.distributed.init_process_group("nccl")`.
  • DeepSpeed: Uses its own distributed initialization. See `Pytorch_Serve_DeepSpeed_Environment` for details.
  • TorchServe integration: `parallelLevel` in model-config.yaml controls the number of worker processes per model, which maps to distributed world size.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment