Heuristic:Vllm project Vllm GPU Memory Utilization Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Inference, Memory Management |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Tune --gpu-memory-utilization (default 0.9) to control how much GPU memory vLLM pre-allocates for the KV cache, balancing throughput against OOM risk.
Description
vLLM pre-allocates a fixed fraction of GPU memory for its PagedAttention KV cache during engine initialization. The gpu_memory_utilization parameter (default 0.9) controls this fraction. More allocated memory means more KV cache blocks, which means more concurrent requests can be served, yielding higher throughput. However, setting it too high risks OOM errors, especially when sharing a GPU with other processes or running multiple vLLM instances. Complementary controls include cpu_offload_gb (which virtually extends GPU memory by offloading to CPU), swap_space (CPU swap per GPU, default 4 GiB), and cache_dtype (FP8 KV cache to reduce per-block memory).
Usage
Apply this heuristic when launching a vLLM server or creating an LLM engine instance. The setting is per-instance and does not account for other vLLM instances on the same GPU. Adjust based on deployment scenario: single-instance serving, multi-instance GPU sharing, memory-constrained environments, or maximum-throughput configurations.
The Insight (Rule of Thumb)
- Action: Set
--gpu-memory-utilizationbased on deployment scenario (default is 0.9). - Single-instance serving: Keep at 0.9 (the default). This reserves 10% headroom for non-KV-cache GPU operations.
- Maximum throughput: Increase to 0.95 if no other GPU consumers exist and the model fits comfortably. More cache blocks = more concurrent requests.
- Multi-instance (GPU sharing): Lower to 0.45-0.5 per instance when running two vLLM instances on the same GPU.
- OOM recovery: Reduce to 0.7-0.8, or use
--cpu-offload-gbto virtually extend GPU memory (e.g., setting--cpu-offload-gb 10on a 24 GB GPU makes it behave like a 34 GB GPU). - Memory-efficient KV cache: Use
--kv-cache-dtype fp8_e4m3(requires CUDA 11.8+) to roughly halve KV cache memory per block, allowing more cache blocks at the same utilization level. - Trade-off: Higher utilization = more KV cache slots = higher throughput, but increased risk of OOM. Lower utilization = more stable but lower concurrency.
Reasoning
vLLM uses PagedAttention, which manages KV cache memory in fixed-size blocks (similar to virtual memory pages). At initialization, vLLM profiles the model's memory footprint, then allocates all remaining GPU memory up to the gpu_memory_utilization fraction for KV cache blocks. This is a one-time, upfront allocation -- vLLM does not dynamically grow or shrink the cache pool at runtime.
The consequence is straightforward: more memory allocated to KV cache = more blocks available = more sequences can be processed concurrently = higher throughput. The default of 0.9 is a well-tested balance that works for most single-instance deployments, leaving 10% headroom for CUDA context overhead, temporary buffers, and other GPU consumers.
The per-instance nature of this setting (documented in cache.py:53-56) is a critical detail: vLLM does not coordinate memory usage across instances. If two vLLM instances each request 0.9 of GPU memory, the second will OOM. Operators must manually partition the memory budget.
The cpu_offload_gb parameter provides an elegant escape hatch for memory-constrained scenarios. Rather than reducing the effective KV cache size (which hurts throughput), it moves some data to CPU memory, maintaining the logical cache capacity at the cost of CPU-GPU transfer latency.
Code Evidence
Default gpu_memory_utilization from vllm/config/cache.py:49:
gpu_memory_utilization: float = Field(default=0.9, gt=0, le=1)
"""The fraction of GPU memory to be used for the model executor, which can
range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory
utilization."""
Per-instance scope from vllm/config/cache.py:53-56:
This is a per-instance limit, and only applies to the current vLLM instance. It does not matter if you have another vLLM instance running on the same GPU.
CPU offloading as virtual GPU extension from vllm/config/cache.py:95-100:
cpu_offload_gb: float = Field(default=0, ge=0)
"""The space in GiB to offload to CPU, per GPU. Default is 0, which means
no offloading. Intuitively, this argument can be seen as a virtual way to
increase the GPU memory size. For example, if you have one 24 GB GPU and
set this to 10, virtually you can think of it as a 34 GB GPU."""
Swap space default from vllm/config/cache.py:57-58:
swap_space: float = Field(default=4, ge=0)
"""Size of the CPU swap space per GPU (in GiB)."""
FP8 KV cache option from vllm/config/cache.py:59-62:
cache_dtype: CacheDType = "auto"
"""Data type for kv cache storage. If "auto", will use model data type.
CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2."""