Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:InternLM Lmdeploy KV Cache Memory Tuning

From Leeroopedia




Knowledge Sources
Domains Optimization, Memory_Management
Last Updated 2026-02-07 15:00 GMT

Overview

Memory management technique using `cache_max_entry_count` to control the percentage of free GPU memory allocated to KV cache, defaulting to 0.8 (80% of free memory).

Description

The `cache_max_entry_count` parameter controls how much GPU memory LMDeploy reserves for the KV cache. Since version v0.2.1, this represents the percentage of free GPU memory (not total). The KV cache is allocated in fixed-size blocks, each consuming `cache_block_seq_len * num_layers * kv_head_num * head_dim * 2 * sizeof(dtype)` bytes. The number of blocks directly determines how many concurrent sequences can be served and how long each sequence can be.

Usage

Use this heuristic when:

  • You encounter CUDA out of memory errors during inference.
  • You need to maximize concurrent requests on your GPU.
  • You are running alongside other GPU workloads and need to limit LMDeploy memory usage.
  • You want to fine-tune the memory/throughput trade-off for your specific deployment.

The Insight (Rule of Thumb)

  • Action: Adjust `cache_max_entry_count` in `TurbomindEngineConfig` or `PytorchEngineConfig`.
  • Default Value: `0.8` (80% of free GPU memory for KV cache).
  • OOM Fix: Reduce to `0.2`-`0.5` to leave more memory for model weights and activations.
  • Max Throughput: Keep at `0.8` or higher if no OOM occurs.
  • Absolute Control: Set to an integer > 0 to specify exact number of KV blocks (TurboMind only).
  • Trade-off: Lower values reduce concurrent capacity but prevent OOM. Higher values maximize throughput but risk memory pressure.

Version Warning: In lmdeploy v0.2.0-v0.2.1, this parameter meant percentage of total GPU memory. In v0.2.1+, it means percentage of free GPU memory. This is a critical distinction when upgrading.

Reasoning

GPU memory is shared between model weights, activations, and the KV cache. For a 7B parameter model in FP16, weights consume ~14GB. On an 80GB A100, this leaves ~66GB free. At `cache_max_entry_count=0.8`, LMDeploy allocates 52.8GB for KV cache. Each KV block for LLaMA-2-7B with `cache_block_seq_len=64` consumes:

`64 * 32 * 32 * 128 * 2 * 2 bytes = 32MB per block`

This yields ~1650 blocks, supporting significant concurrent sequence capacity. Reducing to `0.2` yields ~412 blocks, which may be insufficient for high-concurrency serving but prevents OOM when running larger models or on smaller GPUs.

Empirical guidance from docs: Start with the default (0.8). If OOM occurs, reduce to 0.2 and gradually increase until stable.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment