Heuristic:LMCache LMCache CacheGen Quantization Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Compression, LLMs |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Model-family-tuned quantization strategy for CacheGen KV cache compression that uses more precision bins for early layers and aggressively quantizes later layers, with V-cache quantized more aggressively than K-cache.
Description
CacheGen compresses KV cache using arithmetic coding with layer-dependent quantization bins. The quantization strategy is tuned per model family: for 7B/8B models (Mistral, Llama-3.1, Qwen), K-cache uses 32 bins for layers 0-10 and 16 bins for layers 10+, while V-cache uses 32 bins only for layers 0-2 and 16 bins for the rest. This asymmetric strategy reflects the empirical finding that early layers carry more critical information and that K-cache is more sensitive to quantization noise than V-cache. For small models (< 10 layers), all layers use 32 bins (no aggressive quantization). GPU processing is limited to 256 tokens per chunk (`CACHEGEN_GPU_MAX_TOKENS_PER_CHUNK`).
Usage
This heuristic applies when enabling CacheGen compression for KV cache storage. The quantization config is automatically selected based on model name via `CacheGenConfig.from_model_name()`. If your model is not in the recognized family lists, the system will attempt to auto-detect via `AutoConfig.from_pretrained()` and apply the default 10-layer threshold strategy. Custom tuning (marked as TODO in the code) may improve quality for untested model families, particularly the 9B GLM family which the code notes "needs tuning for better quality."
The Insight (Rule of Thumb)
- Action: CacheGen automatically selects per-layer quantization bins based on model family.
- Value: K-cache: 32 bins (layers 0-10), 16 bins (layers 10+). V-cache: 32 bins (layers 0-2), 16 bins (layers 2+).
- Trade-off: More aggressive quantization in later layers and V-cache reduces storage by ~50% but introduces quantization noise. Quality impact varies by model.
- Chunk limit: GPU processing capped at 256 tokens per chunk.
Reasoning
Early transformer layers (0-10 for K, 0-2 for V) capture more fundamental features and are more sensitive to quantization loss. Later layers produce more redundant representations that tolerate coarser quantization. V-cache is more aggressive because value projections contain less positional information than key projections. The layer-10 threshold for K-cache and layer-2 threshold for V-cache were empirically determined for 7B-class models. The GLM-4-9b-chat model (40 layers) uses the same thresholds but is explicitly marked as needing further quality tuning.
Code Evidence
Model-family quantization from `lmcache/storage_backend/serde/cachegen_basics.py:42-86`:
family_7b = ["mistralai/Mistral-7B-Instruct-v0.2", "lmsys/longchat-7b-16k", "Qwen/Qwen-7B"]
family_8b = ["meta-llama/Llama-3.1-8B-Instruct"]
family_9b = ["THUDM/glm-4-9b-chat"]
if model_name in family_7b:
return CacheGenConfig(
nlayers=32,
kspecs=[
QuantizationSpec(start_layer=0, end_layer=10, bins=32),
QuantizationSpec(start_layer=10, end_layer=32, bins=16),
],
vspecs=[
QuantizationSpec(start_layer=0, end_layer=2, bins=32),
QuantizationSpec(start_layer=2, end_layer=32, bins=16),
],
)
GPU chunk size limit from `lmcache/storage_backend/serde/cachegen_basics.py:18`:
CACHEGEN_GPU_MAX_TOKENS_PER_CHUNK = 256
Quality tuning TODO from `lmcache/storage_backend/serde/cachegen_basics.py:74`:
# TODO(Jiayi): needs tuning for better quality
elif model_name in family_9b: