Environment:Pytorch Serve Distributed Training Environment
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Inference, Infrastructure |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Distributed inference environment with multi-GPU coordination via PyTorch distributed, PiPPy pipeline parallelism, and NCCL backend.
Description
This environment defines the prerequisites for running TorchServe in distributed mode across multiple GPUs or nodes. It encompasses PiPPy (Pipeline Parallelism), PyTorch native tensor parallelism, and DeepSpeed distributed setups. The environment relies on PyTorch's distributed process group (NCCL backend) and a set of environment variables (LOCAL_RANK, WORLD_SIZE, RANK) that TorchServe's worker processes read at startup. PiPPy additionally requires RPC-based device mapping across CUDA devices.
Usage
Use this environment when serving large models that require distributed inference across multiple GPUs. This is the prerequisite for the Large Model Inference workflow steps involving PiPPy, DeepSpeed, or native tensor parallelism.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | Multi-GPU distributed only on Linux |
| Hardware | 2+ NVIDIA GPUs | NVLink or PCIe interconnect; NVLink preferred for tensor parallelism |
| Network | Low-latency interconnect | For multi-node: InfiniBand or high-speed Ethernet |
| VRAM | Model-dependent | Total VRAM across GPUs must fit the model |
Dependencies
System Packages
- NVIDIA GPU driver >= 450
- CUDA Toolkit >= 11.0
- NCCL 2.x (for GPU-to-GPU communication)
Python Packages
- `torch` with CUDA and distributed support
- `pippy` (optional, for pipeline parallelism): `torchpippy` == 0.1.1
- `deepspeed` (optional, for DeepSpeed parallelism)
- `torchserve`
Credentials
The following environment variables are required for distributed operation:
- `LOCAL_RANK`: Local rank of this process on the current node. Required by PiPPy handler (no default). Default 0 for DeepSpeed.
- `WORLD_SIZE`: Total number of processes across all nodes. Required by PiPPy handler (no default).
- `RANK`: Global rank of this process (default: 0).
- `LOCAL_WORLD_SIZE`: Number of processes per node (default: 0).
Quick Install
# Install PyTorch with CUDA and distributed support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For PiPPy pipeline parallelism
pip install torchpippy==0.1.1
# For DeepSpeed
pip install deepspeed
# Install TorchServe
pip install torchserve torch-model-archiver
Code Evidence
PiPPy handler reading required environment variables from `ts/torch_handler/distributed/base_pippy_handler.py:19-22`:
self.local_rank = int(os.environ["LOCAL_RANK"])
self.world_size = int(os.environ["WORLD_SIZE"])
n_devs = torch.cuda.device_count()
self.device = self.local_rank % n_devs
PiPPy RPC device mapping from `ts/handler_utils/distributed/pt_pippy.py:29-32`:
n_devs = torch.cuda.device_count()
options = rpc.TensorPipeRpcBackendOptions(
num_worker_threads=256, _transports=["uv"]
)
for i in range(n_devs):
options.set_device_map(f"worker{i}", {i: i})
PiPPy availability check from `ts/handler_utils/distributed/pt_pippy.py:10-14`:
pippy_installed = importlib.util.find_spec("pippy") is not None
if pippy_installed:
from pippy import split_into_equal_size
from pippy.PipelineStage import PipelineStage
NCCL initialization for tensor parallelism from `examples/large_models/tp_llama/llama_handler.py:56-57`:
if not torch.distributed.is_initialized():
torch.distributed.init_process_group("nccl")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `KeyError: 'LOCAL_RANK'` | LOCAL_RANK not set (PiPPy requires it) | Set LOCAL_RANK via TorchServe model config `parallelLevel` or manually |
| `NCCL error: unhandled system error` | Multi-GPU communication failure | Verify NCCL installation and GPU visibility with `nvidia-smi` |
| `RuntimeError: No module named 'pippy'` | PiPPy not installed | `pip install torchpippy==0.1.1` |
| Worker processes hang on startup | Process group initialization timeout | Increase startup timeout; check all workers can reach each other |
Compatibility Notes
- PiPPy: Requires `pippy` package. Uses RPC-based device mapping for pipeline stages. Workers communicate via TensorPipe transport.
- Tensor Parallelism: PyTorch native TP (used by tp_llama example) requires NCCL backend and `torch.distributed.init_process_group("nccl")`.
- DeepSpeed: Uses its own distributed initialization. See `Pytorch_Serve_DeepSpeed_Environment` for details.
- TorchServe integration: `parallelLevel` in model-config.yaml controls the number of worker processes per model, which maps to distributed world size.