Heuristic:Kserve Kserve VLLM GPU Memory Utilization

Knowledge Sources	KServe vLLM GPU Memory
Domains	LLM_Serving, Optimization
Last Updated	2026-02-13 14:00 GMT

Overview

GPU memory utilization tuning for vLLM varies by deployment pattern: 0.95 for basic, 0.97 for prefill, 0.99 for decode pools.

Description

The `--gpu-memory-utilization` parameter in vLLM controls how much of the available GPU VRAM is allocated for the KV cache and model weights. Higher values maximize throughput by allowing more concurrent requests but risk OOM errors during memory spikes. The optimal value depends on the deployment pattern: basic (single pool), prefill-only, or decode-only in disaggregated serving.

Usage

Use this heuristic when configuring vLLM instances for LLMInferenceService deployments. Apply different utilization values depending on whether the instance handles prefill, decode, or both operations.

The Insight (Rule of Thumb)

Action: Set `--gpu-memory-utilization` based on deployment role.
Value:
- Basic (unified): 0.95 (conservative, handles both prefill and decode)
- Prefill pool: 0.97 (slightly lower due to compute overhead from attention calculations)
- Decode pool: 0.99 (memory-intensive, sequential token generation needs maximum KV cache)
Trade-off: Higher utilization increases throughput but reduces headroom for memory spikes. Prefill pools need more compute headroom than decode pools.
Additional: Always pair with `--enforce-eager` to disable CUDA graph overhead and `--max-model-len` to cap sequence length.

Reasoning

Prefill operations process entire input prompts in parallel, which is compute-bound and creates temporary activation memory spikes. Decode operations generate tokens sequentially, which is memory-bound (reading KV cache) with stable memory patterns. Therefore:

Decode pools can safely use 0.99 because memory usage is predictable and stable during sequential generation.
Prefill pools should use 0.97 to leave headroom for the burst of compute-related temporary allocations during parallel prompt processing.
Basic pools use 0.95 because they must handle both patterns unpredictably.

Evidence from KServe sample configurations:

# Basic configuration (from llm-inference-service-qwen2-7b-gpu.yaml)
--gpu-memory-utilization 0.95 --max-model-len 8192 --enforce-eager

# Decode pool (from deepseek-r1 PD config)
--gpu-memory-utilization 0.99 --max-model-len 4096 --enforce-eager

# Prefill pool (from deepseek-r1 PD config)
--gpu-memory-utilization 0.97 --max-model-len 4096 --enforce-eager

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment