Environment:Vllm project Vllm Environment Variables

Knowledge Sources	vLLM vLLM Docs
Domains	Machine Learning, Systems Engineering, GPU Computing
Last Updated	2026-02-08 13:00 GMT

Overview

Key environment variables used by vLLM for installation, runtime configuration, performance tuning, and debugging.

Description

vLLM reads a large set of environment variables defined in vllm/envs.py to control behaviour at both installation time and runtime. These variables govern target device selection, distributed execution, API security, kernel selection, logging verbosity, cache paths, and platform-specific tuning (e.g., ROCm). Understanding these variables is essential for deploying vLLM in production, debugging performance problems, and hardening multi-tenant serving environments.

Usage

Set environment variables before launching vLLM. Installation-time variables must be present when running pip install or building from source. Runtime variables must be exported in the shell or container environment before the Python process starts.

# Installation example
VLLM_TARGET_DEVICE=cuda MAX_JOBS=8 pip install vllm

# Runtime example
export CUDA_VISIBLE_DEVICES=0,1
export VLLM_API_KEY="sk-my-secret-key"
export VLLM_LOGGING_LEVEL=DEBUG
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct

System Requirements

N/A -- this page documents environment variables, not a standalone system. See the related Implementation pages for hardware and software prerequisites.

Dependencies

Python 3.9+ -- required to run vLLM
CUDA Toolkit / ROCm -- the specific toolkit must match VLLM_TARGET_DEVICE
NCCL -- used for distributed communication; version constrained by VLLM_NCCL_SO_PATH
huggingface_hub -- reads HF_TOKEN for gated model access

Installation-Time Variables

Variable	Valid Values	Default	Description
`VLLM_TARGET_DEVICE`	"cuda", "rocm", "cpu"	"cuda"	Selects the hardware backend for compilation
`VLLM_MAIN_CUDA_VERSION`	CUDA version string	"12.9"	Main CUDA version used when building wheels
`MAX_JOBS`	Positive integer	(system-dependent)	Maximum number of parallel compilation jobs
`NVCC_THREADS`	Positive integer	(system-dependent)	Number of threads used by the NVIDIA CUDA compiler
`VLLM_USE_PRECOMPILED`	Any truthy value	(unset)	When set, uses precompiled binary extensions instead of building from source
`CMAKE_BUILD_TYPE`	"Debug", "Release", "RelWithDebInfo"	"Release"	CMake build configuration type

Runtime Variables -- Core

Variable	Valid Values	Default	Description
`CUDA_VISIBLE_DEVICES`	Comma-separated GPU indices	(all GPUs)	Restricts which GPUs are visible to the process
`LOCAL_RANK`	Non-negative integer	0	Local process rank for distributed training/serving
`VLLM_HOST_IP`	IP address string	(auto-detected)	Node IP address used for distributed communication
`VLLM_PORT`	Integer port number	(varies)	Communication port for distributed workers. Must be a plain integer, not a URI
`VLLM_ENABLE_V1_MULTIPROCESSING`	"0" or "1"	True	Enables the V1 multiprocessing execution mode
`VLLM_WORKER_MULTIPROC_METHOD`	"fork", "spawn"	"fork"	Python multiprocessing start method for worker processes

Runtime Variables -- API/Security

Variable	Valid Values	Default	Description
`VLLM_API_KEY`	Any string	(unset)	API key required by the vLLM OpenAI-compatible server for request authentication
`S3_ACCESS_KEY_ID`	AWS-style access key	(unset)	S3 access key for the tensorizer model loader
`S3_SECRET_ACCESS_KEY`	AWS-style secret key	(unset)	S3 secret key for the tensorizer model loader
`S3_ENDPOINT_URL`	URL string	(unset)	S3-compatible endpoint URL for the tensorizer model loader
`HF_TOKEN`	HuggingFace API token	(unset)	Used implicitly by `huggingface_hub` for gated model access

Runtime Variables -- Performance

Variable	Valid Values	Default	Description
`VLLM_USE_DEEP_GEMM`	"0" or "1"	True	Enables DeepGemm kernels for improved GEMM performance
`VLLM_USE_FLASHINFER_SAMPLER`	"0" or "1"	(unset)	Uses FlashInfer's GPU-based sampler instead of the default sampler
`VLLM_FUSED_MOE_CHUNK_SIZE`	Positive integer	16384 (16*1024)	Chunk size for fused Mixture-of-Experts kernel execution
`VLLM_FLASHINFER_MOE_BACKEND`	"throughput", "latency", "masked_gemm"	"latency"	FlashInfer MoE backend selection; "throughput" favours large batches, "latency" favours small batches
`VLLM_DEEP_GEMM_WARMUP`	"skip", "full", "relax"	"relax"	Controls DeepGemm warmup behaviour; "skip" avoids warmup, "full" runs all plans, "relax" runs a subset
`VLLM_ALLOW_LONG_MAX_MODEL_LEN`	Any truthy value	(unset)	When set, allows `max_model_len` to exceed the model configuration's stated maximum context length
`VLLM_SKIP_P2P_CHECK`	"0" or "1"	True	Skips the peer-to-peer GPU access check. Useful to work around NVIDIA driver 535 series bugs

Runtime Variables -- ROCm-specific

Variable	Valid Values	Default	Description
`VLLM_ROCM_USE_AITER`	"0" or "1"	(unset)	Master switch to enable AITER (AMD Inference TEnsoR) optimised operators on ROCm
`VLLM_ROCM_FP8_PADDING`	"0" or "1"	(unset)	Pads FP8 weight tensors for improved memory alignment on ROCm
`VLLM_ROCM_CUSTOM_PAGED_ATTN`	"0" or "1"	(unset)	Enables the custom paged attention kernel optimised for MI3* series accelerators

Runtime Variables -- Logging/Debug

Variable	Valid Values	Default	Description
`VLLM_LOGGING_LEVEL`	"DEBUG", "INFO", "WARNING", "ERROR"	"INFO"	Controls the vLLM logger verbosity level
`VLLM_CONFIGURE_LOGGING`	"0" or "1"	True	When True, vLLM configures the root logger on import
`VLLM_TRACE_FUNCTION`	"0" or "1"	(unset)	Enables function call tracing for debugging execution flow
`VLLM_COMPUTE_NANS_IN_LOGITS`	"0" or "1"	(unset)	Checks for NaN values in logits during generation; useful for debugging numerical issues
`VLLM_NO_USAGE_STATS`	"0" or "1"	(unset)	Disables anonymous usage statistics collection
`VLLM_DO_NOT_TRACK`	"0" or "1"	(unset)	Alternative flag to disable all tracking and telemetry

Runtime Variables -- Cache/Storage

Variable	Valid Values	Default	Description
`VLLM_CACHE_ROOT`	Directory path	~/.cache/vllm	Root directory for vLLM's on-disk caches (compiled kernels, downloaded assets)
`VLLM_CONFIG_ROOT`	Directory path	~/.config/vllm	Root directory for vLLM configuration files
`VLLM_ASSETS_CACHE`	Directory path	(derived from VLLM_CACHE_ROOT)	Path for cached assets such as tokenizer files and model metadata

Credentials

This section lists all secrets and credentials consumed by vLLM. Handle these values with care -- never hard-code them in source, commit them to version control, or log them.

Variable	Purpose	Rotation/Scope Notes
`VLLM_API_KEY`	Authenticates incoming requests to the vLLM OpenAI-compatible API server	Rotate periodically; scope one key per deployment or tenant
`S3_ACCESS_KEY_ID`	Authenticates to an S3-compatible object store for the tensorizer model loader	Use IAM roles in cloud environments where possible; scope to read-only on the model bucket
`S3_SECRET_ACCESS_KEY`	Secret counterpart to `S3_ACCESS_KEY_ID`	Always pair with `S3_ACCESS_KEY_ID`; never expose in logs or error messages
`S3_ENDPOINT_URL`	S3-compatible endpoint URL (not itself a secret, but required for non-AWS S3 stores)	Ensure HTTPS is used in production
`HF_TOKEN`	HuggingFace API token consumed by `huggingface_hub` for downloading gated models	Generate fine-grained tokens scoped to the required repositories; revoke after use in CI

Best practices:

Inject credentials via Kubernetes Secrets, Docker secrets, or a vault system -- never via plain .env files on shared hosts.
Set VLLM_API_KEY in every production deployment to prevent unauthenticated access.
Audit HF_TOKEN usage in CI pipelines; prefer short-lived tokens.

Quick Install

N/A -- environment variables are not installed. See Implementation:Vllm_project_Vllm_Pip_Install_Vllm for vLLM installation instructions.

Code Evidence

Target Device Selection

# vllm/envs.py:463
"VLLM_TARGET_DEVICE": lambda: os.getenv("VLLM_TARGET_DEVICE", "cuda").lower(),

The device string is lowercased, so CUDA, Cuda, and cuda are all equivalent.

GPU Memory Utilization Default

# From cache config (Pydantic model)
gpu_memory_utilization: float = Field(default=0.9, gt=0, le=1)

This is not an environment variable itself, but the default value (0.9) is frequently overridden via LLM(gpu_memory_utilization=...) and can interact with KV cache sizing driven by available GPU memory.

NCCL Bug Workaround

# vllm/envs.py:555-556
# Path to the NCCL library file. It is needed because nccl>=2.19 brought
# by PyTorch contains a bug: https://github.com/NVIDIA/nccl/issues/1234
"VLLM_NCCL_SO_PATH": lambda: os.environ.get("VLLM_NCCL_SO_PATH", None),

If you encounter NCCL-related crashes or hangs in distributed mode, point this variable to a known-good NCCL shared library.

P2P Check Skip

# vllm/envs.py:866-871
# We assume drivers can report p2p status correctly.
# If the program hangs when using custom allreduce,
# potantially caused by a bug in the driver (535 series),
# if might be helpful to set VLLM_SKIP_P2P_CHECK=0
"VLLM_SKIP_P2P_CHECK": lambda: os.getenv("VLLM_SKIP_P2P_CHECK", "1") == "1",

Defaults to True (skip the check). Set to "0" to force the peer-to-peer capability check when troubleshooting custom allreduce hangs on NVIDIA driver 535 series.

Common Errors

Error / Symptom	Cause	Resolution
`VLLM_PORT appears to be a URI`	Kubernetes service discovery automatically sets `VLLM_PORT` to a full URI (e.g., `tcp://10.0.0.1:8000`) when a Service named "vllm-port" exists	Rename the Kubernetes Service to avoid the naming collision, or explicitly set `VLLM_PORT` to a plain integer in the pod spec
`ValueError: Invalid value for VLLM_TARGET_DEVICE`	An unsupported device string was provided	Use one of the supported values: `"cuda"`, `"rocm"`, or `"cpu"`
Program hangs with custom allreduce	NVIDIA driver 535 series has a bug in reporting peer-to-peer GPU access capabilities	Set `VLLM_SKIP_P2P_CHECK=0` to force the P2P check instead of assuming it works; alternatively, upgrade the NVIDIA driver past the 535 series

Compatibility Notes

CUDA vs. ROCm: ROCm-specific variables (VLLM_ROCM_*) have no effect when VLLM_TARGET_DEVICE is not "rocm". Conversely, CUDA-specific tuning (e.g., NVCC_THREADS) is ignored on ROCm builds.
Kubernetes: Be aware that Kubernetes Service naming conventions can inject environment variables that collide with vLLM variables (notably VLLM_PORT). Always set vLLM variables explicitly in the container spec.
Multiprocessing method: The "fork" default for VLLM_WORKER_MULTIPROC_METHOD may cause issues with certain CUDA driver versions or when using third-party libraries that are not fork-safe. Switch to "spawn" if you observe deadlocks at startup.
DeepGemm: VLLM_USE_DEEP_GEMM requires compatible hardware (NVIDIA Hopper or newer). It is silently ignored on older architectures.

Related Pages

Required By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment