Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:InternLM Lmdeploy KV Quantization Tradeoffs

From Leeroopedia



Knowledge Sources
Domains Optimization, Quantization
Last Updated 2026-02-07 15:00 GMT

Overview

Trade-off analysis for KV cache quantization policies (INT4 vs INT8 vs FP16), offering 2x-4x cache capacity gains with measured accuracy and throughput impacts.

Description

KV cache quantization compresses the key-value pairs stored during autoregressive generation, allowing more sequences to fit in GPU memory. LMDeploy supports three quantization policies: `quant_policy=0` (FP16, no quantization), `quant_policy=8` (INT8), and `quant_policy=4` (INT4). This is an online quantization — it happens at runtime during inference, requiring no offline calibration step.

Usage

Use this heuristic when:

  • You need to increase concurrent request capacity without adding GPUs.
  • You are VRAM-constrained and need more KV cache blocks.
  • You want to understand the accuracy vs throughput trade-off before deploying.
  • You are choosing between INT4 and INT8 for your production workload.

The Insight (Rule of Thumb)

  • For production with accuracy sensitivity: Use `quant_policy=8` (INT8).
    • Cache capacity: 2x more KV blocks vs FP16.
    • Accuracy: Preserved (negligible loss).
    • Throughput: ~30% RPS (requests per second) improvement from handling more concurrent sequences.
  • For maximum throughput with acceptable accuracy loss: Use `quant_policy=4` (INT4).
    • Cache capacity: 4x more KV blocks vs FP16.
    • Accuracy: Slight loss (acceptable for most use cases).
    • Throughput: ~40% RPS improvement.
  • Platform constraint: KV quantization only works on CUDA and Ascend devices. Other platforms must use `quant_policy=0`.
  • Block size constraint: Cambricon (camb) devices require `block_size=16` (auto-enforced).

Reasoning

The KV cache is the primary memory bottleneck in LLM serving. For a 7B model with 32 layers, 32 KV heads, and 128-dim head size, each token in the cache requires:

`32 * 32 * 128 * 2 * 2 bytes (FP16) = 524,288 bytes (~0.5MB) per token`

With INT8 quantization, this halves to ~0.25MB per token. With INT4, it quarters to ~0.125MB per token. On an A100-80GB, this can mean the difference between supporting 200 concurrent 2K-length sequences vs 800 concurrent sequences.

Code evidence from `lmdeploy/messages.py:435-441`:

if self.quant_policy > 0 and self.device_type not in ['cuda', 'ascend']:
    assert False, \
           'kv cache quantization only works for CUDA and ASCEND.'
if self.device_type == 'camb' and self.block_size != 16:
    self.block_size = 16
    logger.warning('Currently, camb device requires block size to be 16, '
                   'setting block size to 16')

Configuration example:

from lmdeploy import pipeline, TurbomindEngineConfig

# INT8 KV quantization (recommended for production)
backend_config = TurbomindEngineConfig(quant_policy=8)
pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)

# INT4 KV quantization (maximum throughput)
backend_config = TurbomindEngineConfig(quant_policy=4)
pipe = pipeline('internlm/internlm2-chat-7b', backend_config=backend_config)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment