Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Vllm project Vllm Environment Variables

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Systems Engineering, GPU Computing
Last Updated 2026-02-08 13:00 GMT

Overview

Key environment variables used by vLLM for installation, runtime configuration, performance tuning, and debugging.

Description

vLLM reads a large set of environment variables defined in vllm/envs.py to control behaviour at both installation time and runtime. These variables govern target device selection, distributed execution, API security, kernel selection, logging verbosity, cache paths, and platform-specific tuning (e.g., ROCm). Understanding these variables is essential for deploying vLLM in production, debugging performance problems, and hardening multi-tenant serving environments.

Usage

Set environment variables before launching vLLM. Installation-time variables must be present when running pip install or building from source. Runtime variables must be exported in the shell or container environment before the Python process starts.

# Installation example
VLLM_TARGET_DEVICE=cuda MAX_JOBS=8 pip install vllm

# Runtime example
export CUDA_VISIBLE_DEVICES=0,1
export VLLM_API_KEY="sk-my-secret-key"
export VLLM_LOGGING_LEVEL=DEBUG
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct

System Requirements

N/A -- this page documents environment variables, not a standalone system. See the related Implementation pages for hardware and software prerequisites.

Dependencies

  • Python 3.9+ -- required to run vLLM
  • CUDA Toolkit / ROCm -- the specific toolkit must match VLLM_TARGET_DEVICE
  • NCCL -- used for distributed communication; version constrained by VLLM_NCCL_SO_PATH
  • huggingface_hub -- reads HF_TOKEN for gated model access

Installation-Time Variables

Variable Valid Values Default Description
VLLM_TARGET_DEVICE "cuda", "rocm", "cpu" "cuda" Selects the hardware backend for compilation
VLLM_MAIN_CUDA_VERSION CUDA version string "12.9" Main CUDA version used when building wheels
MAX_JOBS Positive integer (system-dependent) Maximum number of parallel compilation jobs
NVCC_THREADS Positive integer (system-dependent) Number of threads used by the NVIDIA CUDA compiler
VLLM_USE_PRECOMPILED Any truthy value (unset) When set, uses precompiled binary extensions instead of building from source
CMAKE_BUILD_TYPE "Debug", "Release", "RelWithDebInfo" "Release" CMake build configuration type

Runtime Variables -- Core

Variable Valid Values Default Description
CUDA_VISIBLE_DEVICES Comma-separated GPU indices (all GPUs) Restricts which GPUs are visible to the process
LOCAL_RANK Non-negative integer 0 Local process rank for distributed training/serving
VLLM_HOST_IP IP address string (auto-detected) Node IP address used for distributed communication
VLLM_PORT Integer port number (varies) Communication port for distributed workers. Must be a plain integer, not a URI
VLLM_ENABLE_V1_MULTIPROCESSING "0" or "1" True Enables the V1 multiprocessing execution mode
VLLM_WORKER_MULTIPROC_METHOD "fork", "spawn" "fork" Python multiprocessing start method for worker processes

Runtime Variables -- API/Security

Variable Valid Values Default Description
VLLM_API_KEY Any string (unset) API key required by the vLLM OpenAI-compatible server for request authentication
S3_ACCESS_KEY_ID AWS-style access key (unset) S3 access key for the tensorizer model loader
S3_SECRET_ACCESS_KEY AWS-style secret key (unset) S3 secret key for the tensorizer model loader
S3_ENDPOINT_URL URL string (unset) S3-compatible endpoint URL for the tensorizer model loader
HF_TOKEN HuggingFace API token (unset) Used implicitly by huggingface_hub for gated model access

Runtime Variables -- Performance

Variable Valid Values Default Description
VLLM_USE_DEEP_GEMM "0" or "1" True Enables DeepGemm kernels for improved GEMM performance
VLLM_USE_FLASHINFER_SAMPLER "0" or "1" (unset) Uses FlashInfer's GPU-based sampler instead of the default sampler
VLLM_FUSED_MOE_CHUNK_SIZE Positive integer 16384 (16*1024) Chunk size for fused Mixture-of-Experts kernel execution
VLLM_FLASHINFER_MOE_BACKEND "throughput", "latency", "masked_gemm" "latency" FlashInfer MoE backend selection; "throughput" favours large batches, "latency" favours small batches
VLLM_DEEP_GEMM_WARMUP "skip", "full", "relax" "relax" Controls DeepGemm warmup behaviour; "skip" avoids warmup, "full" runs all plans, "relax" runs a subset
VLLM_ALLOW_LONG_MAX_MODEL_LEN Any truthy value (unset) When set, allows max_model_len to exceed the model configuration's stated maximum context length
VLLM_SKIP_P2P_CHECK "0" or "1" True Skips the peer-to-peer GPU access check. Useful to work around NVIDIA driver 535 series bugs

Runtime Variables -- ROCm-specific

Variable Valid Values Default Description
VLLM_ROCM_USE_AITER "0" or "1" (unset) Master switch to enable AITER (AMD Inference TEnsoR) optimised operators on ROCm
VLLM_ROCM_FP8_PADDING "0" or "1" (unset) Pads FP8 weight tensors for improved memory alignment on ROCm
VLLM_ROCM_CUSTOM_PAGED_ATTN "0" or "1" (unset) Enables the custom paged attention kernel optimised for MI3* series accelerators

Runtime Variables -- Logging/Debug

Variable Valid Values Default Description
VLLM_LOGGING_LEVEL "DEBUG", "INFO", "WARNING", "ERROR" "INFO" Controls the vLLM logger verbosity level
VLLM_CONFIGURE_LOGGING "0" or "1" True When True, vLLM configures the root logger on import
VLLM_TRACE_FUNCTION "0" or "1" (unset) Enables function call tracing for debugging execution flow
VLLM_COMPUTE_NANS_IN_LOGITS "0" or "1" (unset) Checks for NaN values in logits during generation; useful for debugging numerical issues
VLLM_NO_USAGE_STATS "0" or "1" (unset) Disables anonymous usage statistics collection
VLLM_DO_NOT_TRACK "0" or "1" (unset) Alternative flag to disable all tracking and telemetry

Runtime Variables -- Cache/Storage

Variable Valid Values Default Description
VLLM_CACHE_ROOT Directory path ~/.cache/vllm Root directory for vLLM's on-disk caches (compiled kernels, downloaded assets)
VLLM_CONFIG_ROOT Directory path ~/.config/vllm Root directory for vLLM configuration files
VLLM_ASSETS_CACHE Directory path (derived from VLLM_CACHE_ROOT) Path for cached assets such as tokenizer files and model metadata

Credentials

This section lists all secrets and credentials consumed by vLLM. Handle these values with care -- never hard-code them in source, commit them to version control, or log them.

Variable Purpose Rotation/Scope Notes
VLLM_API_KEY Authenticates incoming requests to the vLLM OpenAI-compatible API server Rotate periodically; scope one key per deployment or tenant
S3_ACCESS_KEY_ID Authenticates to an S3-compatible object store for the tensorizer model loader Use IAM roles in cloud environments where possible; scope to read-only on the model bucket
S3_SECRET_ACCESS_KEY Secret counterpart to S3_ACCESS_KEY_ID Always pair with S3_ACCESS_KEY_ID; never expose in logs or error messages
S3_ENDPOINT_URL S3-compatible endpoint URL (not itself a secret, but required for non-AWS S3 stores) Ensure HTTPS is used in production
HF_TOKEN HuggingFace API token consumed by huggingface_hub for downloading gated models Generate fine-grained tokens scoped to the required repositories; revoke after use in CI

Best practices:

  • Inject credentials via Kubernetes Secrets, Docker secrets, or a vault system -- never via plain .env files on shared hosts.
  • Set VLLM_API_KEY in every production deployment to prevent unauthenticated access.
  • Audit HF_TOKEN usage in CI pipelines; prefer short-lived tokens.

Quick Install

N/A -- environment variables are not installed. See Implementation:Vllm_project_Vllm_Pip_Install_Vllm for vLLM installation instructions.

Code Evidence

Target Device Selection

# vllm/envs.py:463
"VLLM_TARGET_DEVICE": lambda: os.getenv("VLLM_TARGET_DEVICE", "cuda").lower(),

The device string is lowercased, so CUDA, Cuda, and cuda are all equivalent.

GPU Memory Utilization Default

# From cache config (Pydantic model)
gpu_memory_utilization: float = Field(default=0.9, gt=0, le=1)

This is not an environment variable itself, but the default value (0.9) is frequently overridden via LLM(gpu_memory_utilization=...) and can interact with KV cache sizing driven by available GPU memory.

NCCL Bug Workaround

# vllm/envs.py:555-556
# Path to the NCCL library file. It is needed because nccl>=2.19 brought
# by PyTorch contains a bug: https://github.com/NVIDIA/nccl/issues/1234
"VLLM_NCCL_SO_PATH": lambda: os.environ.get("VLLM_NCCL_SO_PATH", None),

If you encounter NCCL-related crashes or hangs in distributed mode, point this variable to a known-good NCCL shared library.

P2P Check Skip

# vllm/envs.py:866-871
# We assume drivers can report p2p status correctly.
# If the program hangs when using custom allreduce,
# potantially caused by a bug in the driver (535 series),
# if might be helpful to set VLLM_SKIP_P2P_CHECK=0
"VLLM_SKIP_P2P_CHECK": lambda: os.getenv("VLLM_SKIP_P2P_CHECK", "1") == "1",

Defaults to True (skip the check). Set to "0" to force the peer-to-peer capability check when troubleshooting custom allreduce hangs on NVIDIA driver 535 series.

Common Errors

Error / Symptom Cause Resolution
VLLM_PORT appears to be a URI Kubernetes service discovery automatically sets VLLM_PORT to a full URI (e.g., tcp://10.0.0.1:8000) when a Service named "vllm-port" exists Rename the Kubernetes Service to avoid the naming collision, or explicitly set VLLM_PORT to a plain integer in the pod spec
ValueError: Invalid value for VLLM_TARGET_DEVICE An unsupported device string was provided Use one of the supported values: "cuda", "rocm", or "cpu"
Program hangs with custom allreduce NVIDIA driver 535 series has a bug in reporting peer-to-peer GPU access capabilities Set VLLM_SKIP_P2P_CHECK=0 to force the P2P check instead of assuming it works; alternatively, upgrade the NVIDIA driver past the 535 series

Compatibility Notes

  • CUDA vs. ROCm: ROCm-specific variables (VLLM_ROCM_*) have no effect when VLLM_TARGET_DEVICE is not "rocm". Conversely, CUDA-specific tuning (e.g., NVCC_THREADS) is ignored on ROCm builds.
  • Kubernetes: Be aware that Kubernetes Service naming conventions can inject environment variables that collide with vLLM variables (notably VLLM_PORT). Always set vLLM variables explicitly in the container spec.
  • Multiprocessing method: The "fork" default for VLLM_WORKER_MULTIPROC_METHOD may cause issues with certain CUDA driver versions or when using third-party libraries that are not fork-safe. Switch to "spawn" if you observe deadlocks at startup.
  • DeepGemm: VLLM_USE_DEEP_GEMM requires compatible hardware (NVIDIA Hopper or newer). It is silently ignored on older architectures.

Related Pages

Required By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment