Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:OpenBMB UltraFeedback vLLM Multi GPU Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, Distributed_Training
Last Updated 2026-02-08 06:00 GMT

Overview

Linux environment with CUDA GPUs, vLLM, NCCL, and Ray for multi-GPU tensor-parallel LLM inference.

Description

This environment extends the base Python GPU environment with vLLM for high-throughput batched inference. The main_vllm.py script uses `vllm.LLM` with `tensor_parallel_size=torch.cuda.device_count()` to automatically shard models across all available GPUs. It requires NCCL for inter-GPU communication and Ray for distributed orchestration. The environment sets `NCCL_IGNORE_DISABLED_P2P=1` to handle systems where peer-to-peer GPU communication is disabled, `TOKENIZERS_PARALLELISM=false` to prevent deadlocks in forked processes, and `RAY_memory_monitor_refresh_ms=0` to disable Ray memory monitoring. The `CUDA_LAUNCH_BLOCKING=1` flag is also set in run_vllm.sh for synchronous CUDA kernel execution (useful for debugging).

Usage

Use this environment for the vLLM batched completion generation workflow. It is required when running `main_vllm.py` or `run_vllm.sh` for high-throughput inference across multiple GPUs. This is the preferred backend for generating completions at scale due to vLLM's continuous batching and PagedAttention optimizations.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) NCCL and Ray require Linux for multi-GPU
Hardware Multiple NVIDIA GPUs with CUDA support tensor_parallel_size uses all available GPUs via `torch.cuda.device_count()`
Hardware High-speed GPU interconnect NVLink or PCIe recommended; NCCL_IGNORE_DISABLED_P2P=1 for fallback
Disk 100GB+ SSD Model weights plus vLLM swap space (1GB configured)

Dependencies

System Packages

  • `cuda-toolkit` (CUDA runtime compatible with vLLM and PyTorch)
  • NCCL library (for multi-GPU communication)
  • `git-lfs` (for downloading model weights)

Python Packages

  • `vllm` (latest, installed with -U flag)
  • `torch` (with CUDA support)
  • `transformers` (latest, installed with -U flag)
  • `tokenizers` (latest, installed with -U flag)
  • `deepspeed` (latest, installed with -U flag)
  • `accelerate` (latest, installed with -U flag)
  • `ray` (installed as vLLM dependency)
  • `datasets`
  • `pandas`
  • `numpy`
  • `tqdm`

Credentials

The following environment variables must be set:

  • `NCCL_IGNORE_DISABLED_P2P`: Set to `1` to handle systems with disabled peer-to-peer GPU access.
  • `TOKENIZERS_PARALLELISM`: Set to `false` to prevent tokenizer deadlocks in forked processes.
  • `RAY_memory_monitor_refresh_ms`: Set to `0` to disable Ray memory monitoring (prevents OOM kills on shared systems).
  • `CUDA_LAUNCH_BLOCKING`: Set to `1` for synchronous CUDA execution (debugging; optional for production).
  • `HF_TOKEN`: HuggingFace API token (if downloading gated models).

Quick Install

# Install dependencies (as specified in run_vllm.sh)
pip install transformers -U
pip install tokenizers -U
pip install deepspeed -U
pip install accelerate -U
pip install vllm -U

# Set required environment variables
export NCCL_IGNORE_DISABLED_P2P=1
export TOKENIZERS_PARALLELISM=false
export RAY_memory_monitor_refresh_ms=0

Code Evidence

Environment variable configuration from `main_vllm.py:20-21`:

os.environ["NCCL_IGNORE_DISABLED_P2P"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Shell environment exports from `run_vllm.sh:2,13-14`:

export NCCL_IGNORE_DISABLED_P2P=1
export RAY_memory_monitor_refresh_ms=0
CUDA_LAUNCH_BLOCKING=1 python main_vllm_batch.py --model_type ${1}

vLLM model loading with tensor parallelism from `main_vllm.py:91-92`:

gpu_memory_utilization = 0.95
model = LLM(ckpt, gpu_memory_utilization=gpu_memory_utilization, swap_space=1, tensor_parallel_size=torch.cuda.device_count(), trust_remote_code=True, dtype=dtype)

Dependency installation from `run_vllm.sh:4-8`:

pip install transformers -U
pip install tokenizers -U
pip install deepspeed -U
pip install accelerate -U
pip install vllm -U

Common Errors

Error Message Cause Solution
`NCCL error: unhandled system error` Peer-to-peer GPU communication disabled Set `NCCL_IGNORE_DISABLED_P2P=1` (already configured in code)
`Ray out of memory` Ray memory monitor kills processes Set `RAY_memory_monitor_refresh_ms=0` to disable monitoring
`Deadlock in tokenizer` Tokenizer parallelism in forked processes Set `TOKENIZERS_PARALLELISM=false` (already configured in code)
`CUDA out of memory` Model too large for available GPU memory Reduce `gpu_memory_utilization` below 0.95 or add more GPUs

Compatibility Notes

  • Tensor Parallelism: `tensor_parallel_size` is set to `torch.cuda.device_count()`, meaning ALL available GPUs are used. Ensure no other GPU processes are running.
  • dtype selection: StarChat, MPT-30B-chat, and Falcon-40B-instruct use `bfloat16` explicitly; all other models use `auto` dtype.
  • swap_space: Set to 1GB for CPU offloading when GPU memory is exhausted.
  • run_vllm.sh references main_vllm_batch.py: The launch script references `main_vllm_batch.py` but the repository contains `main_vllm.py`. This may indicate a renamed file.
  • CUDA_LAUNCH_BLOCKING=1: Used in run_vllm.sh for debugging; removes asynchronous CUDA execution and may reduce throughput.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment