Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:LMCache LMCache Chunk Size And Default Config

From Leeroopedia





Knowledge Sources
Domains Configuration, Optimization
Last Updated 2026-02-09 00:00 GMT

Overview

Default configuration values for LMCache including the 256-token chunk size, 5GB CPU buffer, LRU eviction policy, and 10-second blocking timeout that balance performance with resource usage across deployment scenarios.

Description

LMCache ships with carefully chosen defaults that work well for most deployments. The 256-token chunk size matches the `CACHEGEN_GPU_MAX_TOKENS_PER_CHUNK` limit and provides a good balance between cache granularity and overhead. The 5GB local CPU buffer (`max_local_cpu_size=5.0`) provides sufficient space for common workloads while leaving room for the serving engine. LRU is the default eviction policy. The 10-second blocking timeout (`blocking_timeout_secs=10`) prevents indefinite waits. Local CPU backend is enabled by default (`local_cpu=True`). The 64MB pin chunk size and 1GB commit batches in lazy memory allocation balance pinning overhead with memory expansion efficiency.

Usage

Refer to these defaults when deploying LMCache without custom tuning or when setting baseline configurations. Adjust `chunk_size` for different sequence length distributions (smaller for short sequences, larger for long context). Increase `max_local_cpu_size` when available system memory is abundant and cache hit rates are important. The `blocking_timeout_secs` may need increasing for high-latency remote backends.

The Insight (Rule of Thumb)

  • Chunk size: 256 tokens (default). Matches GPU processing limit for CacheGen compression.
  • CPU buffer: 5.0 GB (default). Provides room for ~20K chunks of 256 tokens at fp16.
  • Eviction policy: LRU (default). Best for request patterns with temporal locality.
  • Blocking timeout: 10 seconds (default). Prevents indefinite blocking on slow backends.
  • Memory alignment: 4096 bytes (page boundary). Standard OS page alignment for optimal VM performance.
  • Pin chunk size: 64 MB. Balances pinning overhead vs. memory expansion granularity.
  • Commit batch size: 1 GB. Reduces system call overhead for lazy memory allocation.
  • CacheBlend min tokens: 256 tokens. Minimum segment size for blending operations.

Reasoning

The 256-token chunk size is a sweet spot: small enough to enable fine-grained prefix matching (improving cache hit rates), large enough to amortize per-chunk metadata and transfer overhead. It also aligns with the CacheGen GPU processing limit. The 5GB CPU buffer represents ~10% of a typical 64GB server's memory, leaving room for the serving engine and OS. LRU eviction is chosen because LLM serving workloads typically exhibit temporal locality (recent prompts are more likely to be reused). The 4KB memory alignment ensures tensors start on page boundaries for efficient DMA and RDMA operations.

Code Evidence

Default configuration definitions from `lmcache/v1/config.py:62-70`:

_CONFIG_DEFINITIONS: dict[str, dict[str, Any]] = {
    "chunk_size": {"type": int, "default": 256, "env_converter": int},
    "local_cpu": {"type": bool, "default": True, "env_converter": _to_bool},
    "max_local_cpu_size": {"type": float, "default": 5.0, "env_converter": float},
}

Memory alignment constant from `lmcache/v1/memory_management.py`:

ALIGN_BYTES = 4096

Lazy memory allocation constants from `lmcache/v1/lazy_memory_allocator.py:74-75`:

PIN_CHUNK_SIZE = 1 << 26  # 64 MB pin chunk
COMMIT_SIZE = 1 << 30     # Do a commit every 1 GB

CacheGen GPU chunk limit from `lmcache/storage_backend/serde/cachegen_basics.py:18`:

CACHEGEN_GPU_MAX_TOKENS_PER_CHUNK = 256

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment