Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Vllm project Vllm Distributed

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Pipeline_Parallelism
Last Updated 2026-02-08 00:00 GMT

Overview

Distributed computing environment for vLLM's multi-GPU and multi-node inference, providing the communication backends, process coordination, and parallelism strategies (tensor parallelism, pipeline parallelism) required to serve models that exceed single-GPU memory capacity.

Description

This environment defines the distributed computing infrastructure that enables vLLM to partition and serve large language models across multiple GPUs and multiple nodes. vLLM supports two primary parallelism strategies: tensor parallelism (TP), which splits individual layers across GPUs and requires all-reduce communication for each layer's output, and pipeline parallelism (PP), which assigns different layers to different GPUs and passes activations between stages. The sequence abstraction is central to distributed execution, as it tracks which portions of a generation request are being processed on which pipeline stage and manages the handoff of intermediate activations between stages. NCCL (NVIDIA) or RCCL (AMD) serves as the collective communication backend, providing optimized implementations of all-reduce, all-gather, and broadcast operations. For multi-node deployments, the communication backend uses high-speed interconnects (InfiniBand, RoCE) for inter-node communication while using NVLink/xGMI for intra-node GPU-to-GPU transfers.

Usage

Multi-GPU inference is enabled by setting --tensor-parallel-size N (for TP) or --pipeline-parallel-size N (for PP) when launching vLLM. For multi-node deployments, Ray is used for process orchestration and can be configured via ray start on each node. The CUDA_VISIBLE_DEVICES and LOCAL_RANK environment variables control GPU assignment per process. The VLLM_HOST_IP and VLLM_PORT environment variables configure the distributed communication endpoints.

Requirements

Requirement Value
Communication Backend (NVIDIA) NCCL >= 2.18
Communication Backend (AMD) RCCL (ROCm-compatible)
MPI (optional) OpenMPI or MPICH for multi-node launch orchestration
Multiple GPUs 2+ GPUs for tensor parallelism, 2+ GPUs for pipeline parallelism
Interconnect (intra-node) NVLink (NVIDIA) or xGMI (AMD) recommended
Interconnect (multi-node) InfiniBand or RoCE for low-latency inter-node communication
Ray (optional) ray[cgraph] >= 2.48.0 for multi-node orchestration
Python >= 3.10
PyTorch Distributed torch.distributed with NCCL/RCCL/Gloo backend

Semantic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment