Heuristic:InternLM Lmdeploy KV Quantization Tradeoffs
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Quantization |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Trade-off analysis for KV cache quantization policies (INT4 vs INT8 vs FP16), offering 2x-4x cache capacity gains with measured accuracy and throughput impacts.
Description
KV cache quantization compresses the key-value pairs stored during autoregressive generation, allowing more sequences to fit in GPU memory. LMDeploy supports three quantization policies: `quant_policy=0` (FP16, no quantization), `quant_policy=8` (INT8), and `quant_policy=4` (INT4). This is an online quantization — it happens at runtime during inference, requiring no offline calibration step.
Usage
Use this heuristic when:
- You need to increase concurrent request capacity without adding GPUs.
- You are VRAM-constrained and need more KV cache blocks.
- You want to understand the accuracy vs throughput trade-off before deploying.
- You are choosing between INT4 and INT8 for your production workload.
The Insight (Rule of Thumb)
- For production with accuracy sensitivity: Use `quant_policy=8` (INT8).
- Cache capacity: 2x more KV blocks vs FP16.
- Accuracy: Preserved (negligible loss).
- Throughput: ~30% RPS (requests per second) improvement from handling more concurrent sequences.
- For maximum throughput with acceptable accuracy loss: Use `quant_policy=4` (INT4).
- Cache capacity: 4x more KV blocks vs FP16.
- Accuracy: Slight loss (acceptable for most use cases).
- Throughput: ~40% RPS improvement.
- Platform constraint: KV quantization only works on CUDA and Ascend devices. Other platforms must use `quant_policy=0`.
- Block size constraint: Cambricon (camb) devices require `block_size=16` (auto-enforced).
Reasoning
The KV cache is the primary memory bottleneck in LLM serving. For a 7B model with 32 layers, 32 KV heads, and 128-dim head size, each token in the cache requires:
`32 * 32 * 128 * 2 * 2 bytes (FP16) = 524,288 bytes (~0.5MB) per token`
With INT8 quantization, this halves to ~0.25MB per token. With INT4, it quarters to ~0.125MB per token. On an A100-80GB, this can mean the difference between supporting 200 concurrent 2K-length sequences vs 800 concurrent sequences.
Code evidence from `lmdeploy/messages.py:435-441`:
if self.quant_policy > 0 and self.device_type not in ['cuda', 'ascend']:
assert False, \
'kv cache quantization only works for CUDA and ASCEND.'
if self.device_type == 'camb' and self.block_size != 16:
self.block_size = 16
logger.warning('Currently, camb device requires block size to be 16, '
'setting block size to 16')
Configuration example:
from lmdeploy import pipeline, TurbomindEngineConfig
# INT8 KV quantization (recommended for production)
backend_config = TurbomindEngineConfig(quant_policy=8)
pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)
# INT4 KV quantization (maximum throughput)
backend_config = TurbomindEngineConfig(quant_policy=4)
pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)