Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Vllm project Vllm KV Cache Block Size Selection

From Leeroopedia



Knowledge Sources
Domains Optimization, Inference, Memory Management
Last Updated 2026-02-08 00:00 GMT

Overview

Leave KV cache block size at the default (16 for CUDA) unless using an MLA attention backend, which will auto-force the correct block size. Choosing the right block size affects memory efficiency, internal fragmentation, and compatibility with specialized attention kernels.

Description

vLLM manages KV cache memory using a block-based allocator, where each block holds a fixed number of token slots. The block size determines the granularity of memory allocation: larger blocks reduce per-block metadata overhead but waste more memory when sequences do not fill a complete block. The valid block sizes are 1, 8, 16, 32, 64, 128, 256, but CUDA devices only support block sizes up to 32 for standard attention backends. MLA (Multi-head Latent Attention) backends used by DeepSeek-V2/V3 models override the block size automatically to satisfy their kernel alignment requirements.

Additionally, vLLM supports FP8 KV cache quantization (fp8_e4m3 or fp8_e5m2) on CUDA 11.8+ and ROCm GPUs to halve KV cache memory consumption. Prefix caching is enabled by default and benefits from consistent block sizes across requests with shared prefixes.

Usage

Apply this heuristic when configuring vLLM for serving or offline inference. It is relevant for:

  • Tuning memory efficiency for workloads with highly variable sequence lengths
  • Deploying MLA-based models (DeepSeek-V2, DeepSeek-V3) where block size is auto-forced
  • Deciding whether to enable FP8 KV cache quantization for memory-constrained deployments
  • Understanding why your configured block size was silently overridden by an MLA backend

The Insight (Rule of Thumb)

  • Action: Leave block_size at the default (16 for CUDA) unless you have a specific reason to change it.
  • MLA models (DeepSeek-V2/V3): The block size is auto-forced by the attention backend -- FlashMLA and FlashMLASparse force 64, CUTLASS_MLA forces 128, FlashInferMLA forces 64.
  • Larger block sizes (32, 64, 128): Less per-block metadata overhead, but more wasted space (internal fragmentation) for short or variable-length sequences.
  • Smaller block sizes (1, 8): More flexible allocation, better for variable-length sequences, but higher metadata overhead and more block table entries.
  • FP8 KV cache: Use --kv-cache-dtype fp8_e4m3 to halve KV cache memory on GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper).
  • Prefix caching: Enabled by default (enable_prefix_caching=True). Benefits from block-aligned prefix boundaries for cache hit efficiency.
  • CUDA constraint: Standard CUDA attention kernels only support block sizes up to 32. Sizes 64+ are only used by MLA backends.

Reasoning

The default block size of 16 for CUDA represents a well-tested balance between allocation granularity and overhead. The vLLM engine sets this default early during platform initialization before any model-specific logic runs.

Default block size initialization from vllm/platforms/cuda.py:163-164:

cache_config = vllm_config.cache_config
if cache_config and cache_config.block_size is None:
    cache_config.block_size = 16

Valid block sizes defined in vllm/config.py (cache.py):22:

BlockSize = Literal[1, 8, 16, 32, 64, 128, 256]

CUDA block size constraint from vllm/config.py (cache.py):43-44:

"""Size of a contiguous cache block in number of tokens. On CUDA devices,
only block sizes up to 32 are supported."""

MLA backends override the block size to meet kernel alignment requirements. This override happens silently with a log message, so users who set a custom block size may find it ignored.

MLA backend block size forcing from vllm/platforms/cuda.py:225-233:

if (
    use_flashmla
    and is_flashmla_dense_supported()[0]
    and cache_config.block_size % 64 != 0
):
    cache_config.block_size = 64
    logger.info("Forcing kv cache block size to 64 for FlashMLA backend.")

if use_cutlass_mla and cache_config.block_size % 128 != 0:
    cache_config.block_size = 128
    logger.info("Forcing kv cache block size to 128 for CUTLASS_MLA backend.")

FP8 KV cache support from vllm/config.py (cache.py):59-65:

cache_dtype: CacheDType = "auto"
"""CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. ROCm (AMD GPU) supports
fp8 (=fp8_e4m3)."""

Prefix caching default from vllm/config.py (cache.py):76-77:

enable_prefix_caching: bool = True
"""Whether to enable prefix caching."""

Known issue -- Hybrid KV cache latency regression from vllm/envs.py:1380-1385:

# Currently using the Hybrid KV cache manager with chunked local attention
# in the Llama4 models causes a latency regression. For this reason, we disable it by default.
# TODO(lucas): Remove this flag once latency regression is resolved.

This confirms that the hybrid KV cache manager with chunked local attention (used by Llama-4 models) is disabled by default due to a known latency regression, which is another reason to prefer conservative defaults when configuring KV cache parameters.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment