Environment:Vllm project Vllm Environment Variables
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Systems Engineering, GPU Computing |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Key environment variables used by vLLM for installation, runtime configuration, performance tuning, and debugging.
Description
vLLM reads a large set of environment variables defined in vllm/envs.py to control behaviour at both installation time and runtime. These variables govern target device selection, distributed execution, API security, kernel selection, logging verbosity, cache paths, and platform-specific tuning (e.g., ROCm). Understanding these variables is essential for deploying vLLM in production, debugging performance problems, and hardening multi-tenant serving environments.
Usage
Set environment variables before launching vLLM. Installation-time variables must be present when running pip install or building from source. Runtime variables must be exported in the shell or container environment before the Python process starts.
# Installation example
VLLM_TARGET_DEVICE=cuda MAX_JOBS=8 pip install vllm
# Runtime example
export CUDA_VISIBLE_DEVICES=0,1
export VLLM_API_KEY="sk-my-secret-key"
export VLLM_LOGGING_LEVEL=DEBUG
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct
System Requirements
N/A -- this page documents environment variables, not a standalone system. See the related Implementation pages for hardware and software prerequisites.
Dependencies
- Python 3.9+ -- required to run vLLM
- CUDA Toolkit / ROCm -- the specific toolkit must match
VLLM_TARGET_DEVICE - NCCL -- used for distributed communication; version constrained by
VLLM_NCCL_SO_PATH - huggingface_hub -- reads
HF_TOKENfor gated model access
Installation-Time Variables
| Variable | Valid Values | Default | Description |
|---|---|---|---|
VLLM_TARGET_DEVICE |
"cuda", "rocm", "cpu" | "cuda" | Selects the hardware backend for compilation |
VLLM_MAIN_CUDA_VERSION |
CUDA version string | "12.9" | Main CUDA version used when building wheels |
MAX_JOBS |
Positive integer | (system-dependent) | Maximum number of parallel compilation jobs |
NVCC_THREADS |
Positive integer | (system-dependent) | Number of threads used by the NVIDIA CUDA compiler |
VLLM_USE_PRECOMPILED |
Any truthy value | (unset) | When set, uses precompiled binary extensions instead of building from source |
CMAKE_BUILD_TYPE |
"Debug", "Release", "RelWithDebInfo" | "Release" | CMake build configuration type |
Runtime Variables -- Core
| Variable | Valid Values | Default | Description |
|---|---|---|---|
CUDA_VISIBLE_DEVICES |
Comma-separated GPU indices | (all GPUs) | Restricts which GPUs are visible to the process |
LOCAL_RANK |
Non-negative integer | 0 | Local process rank for distributed training/serving |
VLLM_HOST_IP |
IP address string | (auto-detected) | Node IP address used for distributed communication |
VLLM_PORT |
Integer port number | (varies) | Communication port for distributed workers. Must be a plain integer, not a URI |
VLLM_ENABLE_V1_MULTIPROCESSING |
"0" or "1" | True | Enables the V1 multiprocessing execution mode |
VLLM_WORKER_MULTIPROC_METHOD |
"fork", "spawn" | "fork" | Python multiprocessing start method for worker processes |
Runtime Variables -- API/Security
| Variable | Valid Values | Default | Description |
|---|---|---|---|
VLLM_API_KEY |
Any string | (unset) | API key required by the vLLM OpenAI-compatible server for request authentication |
S3_ACCESS_KEY_ID |
AWS-style access key | (unset) | S3 access key for the tensorizer model loader |
S3_SECRET_ACCESS_KEY |
AWS-style secret key | (unset) | S3 secret key for the tensorizer model loader |
S3_ENDPOINT_URL |
URL string | (unset) | S3-compatible endpoint URL for the tensorizer model loader |
HF_TOKEN |
HuggingFace API token | (unset) | Used implicitly by huggingface_hub for gated model access
|
Runtime Variables -- Performance
| Variable | Valid Values | Default | Description |
|---|---|---|---|
VLLM_USE_DEEP_GEMM |
"0" or "1" | True | Enables DeepGemm kernels for improved GEMM performance |
VLLM_USE_FLASHINFER_SAMPLER |
"0" or "1" | (unset) | Uses FlashInfer's GPU-based sampler instead of the default sampler |
VLLM_FUSED_MOE_CHUNK_SIZE |
Positive integer | 16384 (16*1024) | Chunk size for fused Mixture-of-Experts kernel execution |
VLLM_FLASHINFER_MOE_BACKEND |
"throughput", "latency", "masked_gemm" | "latency" | FlashInfer MoE backend selection; "throughput" favours large batches, "latency" favours small batches |
VLLM_DEEP_GEMM_WARMUP |
"skip", "full", "relax" | "relax" | Controls DeepGemm warmup behaviour; "skip" avoids warmup, "full" runs all plans, "relax" runs a subset |
VLLM_ALLOW_LONG_MAX_MODEL_LEN |
Any truthy value | (unset) | When set, allows max_model_len to exceed the model configuration's stated maximum context length
|
VLLM_SKIP_P2P_CHECK |
"0" or "1" | True | Skips the peer-to-peer GPU access check. Useful to work around NVIDIA driver 535 series bugs |
Runtime Variables -- ROCm-specific
| Variable | Valid Values | Default | Description |
|---|---|---|---|
VLLM_ROCM_USE_AITER |
"0" or "1" | (unset) | Master switch to enable AITER (AMD Inference TEnsoR) optimised operators on ROCm |
VLLM_ROCM_FP8_PADDING |
"0" or "1" | (unset) | Pads FP8 weight tensors for improved memory alignment on ROCm |
VLLM_ROCM_CUSTOM_PAGED_ATTN |
"0" or "1" | (unset) | Enables the custom paged attention kernel optimised for MI3* series accelerators |
Runtime Variables -- Logging/Debug
| Variable | Valid Values | Default | Description |
|---|---|---|---|
VLLM_LOGGING_LEVEL |
"DEBUG", "INFO", "WARNING", "ERROR" | "INFO" | Controls the vLLM logger verbosity level |
VLLM_CONFIGURE_LOGGING |
"0" or "1" | True | When True, vLLM configures the root logger on import |
VLLM_TRACE_FUNCTION |
"0" or "1" | (unset) | Enables function call tracing for debugging execution flow |
VLLM_COMPUTE_NANS_IN_LOGITS |
"0" or "1" | (unset) | Checks for NaN values in logits during generation; useful for debugging numerical issues |
VLLM_NO_USAGE_STATS |
"0" or "1" | (unset) | Disables anonymous usage statistics collection |
VLLM_DO_NOT_TRACK |
"0" or "1" | (unset) | Alternative flag to disable all tracking and telemetry |
Runtime Variables -- Cache/Storage
| Variable | Valid Values | Default | Description |
|---|---|---|---|
VLLM_CACHE_ROOT |
Directory path | ~/.cache/vllm | Root directory for vLLM's on-disk caches (compiled kernels, downloaded assets) |
VLLM_CONFIG_ROOT |
Directory path | ~/.config/vllm | Root directory for vLLM configuration files |
VLLM_ASSETS_CACHE |
Directory path | (derived from VLLM_CACHE_ROOT) | Path for cached assets such as tokenizer files and model metadata |
Credentials
This section lists all secrets and credentials consumed by vLLM. Handle these values with care -- never hard-code them in source, commit them to version control, or log them.
| Variable | Purpose | Rotation/Scope Notes |
|---|---|---|
VLLM_API_KEY |
Authenticates incoming requests to the vLLM OpenAI-compatible API server | Rotate periodically; scope one key per deployment or tenant |
S3_ACCESS_KEY_ID |
Authenticates to an S3-compatible object store for the tensorizer model loader | Use IAM roles in cloud environments where possible; scope to read-only on the model bucket |
S3_SECRET_ACCESS_KEY |
Secret counterpart to S3_ACCESS_KEY_ID |
Always pair with S3_ACCESS_KEY_ID; never expose in logs or error messages
|
S3_ENDPOINT_URL |
S3-compatible endpoint URL (not itself a secret, but required for non-AWS S3 stores) | Ensure HTTPS is used in production |
HF_TOKEN |
HuggingFace API token consumed by huggingface_hub for downloading gated models |
Generate fine-grained tokens scoped to the required repositories; revoke after use in CI |
Best practices:
- Inject credentials via Kubernetes Secrets, Docker secrets, or a vault system -- never via plain
.envfiles on shared hosts. - Set
VLLM_API_KEYin every production deployment to prevent unauthenticated access. - Audit
HF_TOKENusage in CI pipelines; prefer short-lived tokens.
Quick Install
N/A -- environment variables are not installed. See Implementation:Vllm_project_Vllm_Pip_Install_Vllm for vLLM installation instructions.
Code Evidence
Target Device Selection
# vllm/envs.py:463
"VLLM_TARGET_DEVICE": lambda: os.getenv("VLLM_TARGET_DEVICE", "cuda").lower(),
The device string is lowercased, so CUDA, Cuda, and cuda are all equivalent.
GPU Memory Utilization Default
# From cache config (Pydantic model)
gpu_memory_utilization: float = Field(default=0.9, gt=0, le=1)
This is not an environment variable itself, but the default value (0.9) is frequently overridden via LLM(gpu_memory_utilization=...) and can interact with KV cache sizing driven by available GPU memory.
NCCL Bug Workaround
# vllm/envs.py:555-556
# Path to the NCCL library file. It is needed because nccl>=2.19 brought
# by PyTorch contains a bug: https://github.com/NVIDIA/nccl/issues/1234
"VLLM_NCCL_SO_PATH": lambda: os.environ.get("VLLM_NCCL_SO_PATH", None),
If you encounter NCCL-related crashes or hangs in distributed mode, point this variable to a known-good NCCL shared library.
P2P Check Skip
# vllm/envs.py:866-871
# We assume drivers can report p2p status correctly.
# If the program hangs when using custom allreduce,
# potantially caused by a bug in the driver (535 series),
# if might be helpful to set VLLM_SKIP_P2P_CHECK=0
"VLLM_SKIP_P2P_CHECK": lambda: os.getenv("VLLM_SKIP_P2P_CHECK", "1") == "1",
Defaults to True (skip the check). Set to "0" to force the peer-to-peer capability check when troubleshooting custom allreduce hangs on NVIDIA driver 535 series.
Common Errors
| Error / Symptom | Cause | Resolution |
|---|---|---|
VLLM_PORT appears to be a URI |
Kubernetes service discovery automatically sets VLLM_PORT to a full URI (e.g., tcp://10.0.0.1:8000) when a Service named "vllm-port" exists |
Rename the Kubernetes Service to avoid the naming collision, or explicitly set VLLM_PORT to a plain integer in the pod spec
|
ValueError: Invalid value for VLLM_TARGET_DEVICE |
An unsupported device string was provided | Use one of the supported values: "cuda", "rocm", or "cpu"
|
| Program hangs with custom allreduce | NVIDIA driver 535 series has a bug in reporting peer-to-peer GPU access capabilities | Set VLLM_SKIP_P2P_CHECK=0 to force the P2P check instead of assuming it works; alternatively, upgrade the NVIDIA driver past the 535 series
|
Compatibility Notes
- CUDA vs. ROCm: ROCm-specific variables (
VLLM_ROCM_*) have no effect whenVLLM_TARGET_DEVICEis not"rocm". Conversely, CUDA-specific tuning (e.g.,NVCC_THREADS) is ignored on ROCm builds. - Kubernetes: Be aware that Kubernetes Service naming conventions can inject environment variables that collide with vLLM variables (notably
VLLM_PORT). Always set vLLM variables explicitly in the container spec. - Multiprocessing method: The
"fork"default forVLLM_WORKER_MULTIPROC_METHODmay cause issues with certain CUDA driver versions or when using third-party libraries that are not fork-safe. Switch to"spawn"if you observe deadlocks at startup. - DeepGemm:
VLLM_USE_DEEP_GEMMrequires compatible hardware (NVIDIA Hopper or newer). It is silently ignored on older architectures.