Environment:OpenBMB UltraFeedback vLLM Multi GPU Environment

Knowledge Sources	OpenBMB UltraFeedback vLLM Documentation
Domains	Infrastructure, Deep_Learning, Distributed_Training
Last Updated	2026-02-08 06:00 GMT

Overview

Linux environment with CUDA GPUs, vLLM, NCCL, and Ray for multi-GPU tensor-parallel LLM inference.

Description

This environment extends the base Python GPU environment with vLLM for high-throughput batched inference. The main_vllm.py script uses `vllm.LLM` with `tensor_parallel_size=torch.cuda.device_count()` to automatically shard models across all available GPUs. It requires NCCL for inter-GPU communication and Ray for distributed orchestration. The environment sets `NCCL_IGNORE_DISABLED_P2P=1` to handle systems where peer-to-peer GPU communication is disabled, `TOKENIZERS_PARALLELISM=false` to prevent deadlocks in forked processes, and `RAY_memory_monitor_refresh_ms=0` to disable Ray memory monitoring. The `CUDA_LAUNCH_BLOCKING=1` flag is also set in run_vllm.sh for synchronous CUDA kernel execution (useful for debugging).

Usage

Use this environment for the vLLM batched completion generation workflow. It is required when running `main_vllm.py` or `run_vllm.sh` for high-throughput inference across multiple GPUs. This is the preferred backend for generating completions at scale due to vLLM's continuous batching and PagedAttention optimizations.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	NCCL and Ray require Linux for multi-GPU
Hardware	Multiple NVIDIA GPUs with CUDA support	tensor_parallel_size uses all available GPUs via `torch.cuda.device_count()`
Hardware	High-speed GPU interconnect	NVLink or PCIe recommended; NCCL_IGNORE_DISABLED_P2P=1 for fallback
Disk	100GB+ SSD	Model weights plus vLLM swap space (1GB configured)

Dependencies

System Packages

`cuda-toolkit` (CUDA runtime compatible with vLLM and PyTorch)
NCCL library (for multi-GPU communication)
`git-lfs` (for downloading model weights)

Python Packages

`vllm` (latest, installed with -U flag)
`torch` (with CUDA support)
`transformers` (latest, installed with -U flag)
`tokenizers` (latest, installed with -U flag)
`deepspeed` (latest, installed with -U flag)
`accelerate` (latest, installed with -U flag)
`ray` (installed as vLLM dependency)
`datasets`
`pandas`
`numpy`
`tqdm`

Credentials

The following environment variables must be set:

`NCCL_IGNORE_DISABLED_P2P`: Set to `1` to handle systems with disabled peer-to-peer GPU access.
`TOKENIZERS_PARALLELISM`: Set to `false` to prevent tokenizer deadlocks in forked processes.
`RAY_memory_monitor_refresh_ms`: Set to `0` to disable Ray memory monitoring (prevents OOM kills on shared systems).
`CUDA_LAUNCH_BLOCKING`: Set to `1` for synchronous CUDA execution (debugging; optional for production).
`HF_TOKEN`: HuggingFace API token (if downloading gated models).

Quick Install

# Install dependencies (as specified in run_vllm.sh)
pip install transformers -U
pip install tokenizers -U
pip install deepspeed -U
pip install accelerate -U
pip install vllm -U

# Set required environment variables
export NCCL_IGNORE_DISABLED_P2P=1
export TOKENIZERS_PARALLELISM=false
export RAY_memory_monitor_refresh_ms=0

Code Evidence

Environment variable configuration from `main_vllm.py:20-21`:

os.environ["NCCL_IGNORE_DISABLED_P2P"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Shell environment exports from `run_vllm.sh:2,13-14`:

export NCCL_IGNORE_DISABLED_P2P=1
export RAY_memory_monitor_refresh_ms=0
CUDA_LAUNCH_BLOCKING=1 python main_vllm_batch.py --model_type ${1}

vLLM model loading with tensor parallelism from `main_vllm.py:91-92`:

gpu_memory_utilization = 0.95
model = LLM(ckpt, gpu_memory_utilization=gpu_memory_utilization, swap_space=1, tensor_parallel_size=torch.cuda.device_count(), trust_remote_code=True, dtype=dtype)

Dependency installation from `run_vllm.sh:4-8`:

pip install transformers -U
pip install tokenizers -U
pip install deepspeed -U
pip install accelerate -U
pip install vllm -U

Common Errors

Error Message	Cause	Solution
`NCCL error: unhandled system error`	Peer-to-peer GPU communication disabled	Set `NCCL_IGNORE_DISABLED_P2P=1` (already configured in code)
`Ray out of memory`	Ray memory monitor kills processes	Set `RAY_memory_monitor_refresh_ms=0` to disable monitoring
`Deadlock in tokenizer`	Tokenizer parallelism in forked processes	Set `TOKENIZERS_PARALLELISM=false` (already configured in code)
`CUDA out of memory`	Model too large for available GPU memory	Reduce `gpu_memory_utilization` below 0.95 or add more GPUs

Compatibility Notes

Tensor Parallelism: `tensor_parallel_size` is set to `torch.cuda.device_count()`, meaning ALL available GPUs are used. Ensure no other GPU processes are running.
dtype selection: StarChat, MPT-30B-chat, and Falcon-40B-instruct use `bfloat16` explicitly; all other models use `auto` dtype.
swap_space: Set to 1GB for CPU offloading when GPU memory is exhausted.
run_vllm.sh references main_vllm_batch.py: The launch script references `main_vllm_batch.py` but the repository contains `main_vllm.py`. This may indicate a renamed file.
CUDA_LAUNCH_BLOCKING=1: Used in run_vllm.sh for debugging; removes asynchronous CUDA execution and may reduce throughput.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment