Implementation:InternLM Lmdeploy AttentionQuantization
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Comprehensive KV cache quantization library providing type conversion kernels, min/max statistics, and dequantization for uint8, uint4, fp8_e4m3, and fp4_e2m1 formats across half, bfloat16, and float compute types.
Description
This header implements the quantization and dequantization infrastructure for KV cache compression. Key components include:
- Type converters:
cvt_f16x4_u8,cvt_bf16x4_u8,cvt_f32x4_u8for uint8-to-float conversion using bit manipulation and magic number techniques;cvt_bf16x8_u4/cvt_f16x8_u4for uint4 usinglop3.b32inline PTX;cvt_bf16x8_e2m1/cvt_f16x8_e2m1for fp4;cvt_bf16x4_e4m3/cvt_f16x4_e4m3for fp8. - Quantization:
warp_statscomputes per-token min/max via warp-level shuffle reductions,quantizeapplies scale/zero-point quantization, androunduses PTXcvt.rnifor banker's rounding. - ConvertKvCache: Template struct family with specializations for all (storage, compute) type pairs. Each provides a
convertstatic method for raw dequantization and anoperator()that applies scale and zero-point. The identity specialization (ConvertKvCache<T,T>) is a pass-through. - StoreQuantParam: Helper for writing quantization parameters with optional fused dequant preprocessing for uint4+fp16.
Usage
Used throughout the attention kernel implementations (Impl_81616, Impl_Simt) for on-the-fly KV cache dequantization, and by AttentionUniversal::Prologue for quantizing new K/V entries into the cache.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/quantization.h
- Lines: 1-888
Signature
namespace turbomind {
// Type conversion helpers
inline __device__ Array<half, 4> cvt_f16x4_u8(const Array<uint8_t, 4>& src);
inline __device__ Array<nv_bfloat16, 4> cvt_bf16x4_u8(const Array<uint8_t, 4>& src);
inline __device__ Array<float, 4> cvt_f32x4_u8(const Array<uint8_t, 4>& src);
template<bool norm>
inline __device__ Array<nv_bfloat16, 8> cvt_bf16x8_u4(const Array<uint4_t, 8>& src);
// Rounding
template<class T> inline __device__ T round(float x);
template<class T> inline __device__ T round(half x);
// Warp-level statistics
template<int WarpThreadC, class T, int C>
__device__ void warp_minmax(Array<T,2>& stats, const Array<T,C>& x);
template<int WarpThreadC, class P, class T, class B, int N, int C, int S>
__device__ void warp_stats(Array<P,2> (¶m)[S], const Array<T,N> (&x)[S][C], B n_bits);
// KV cache converter (generic, identity, u8, u4, fp8, fp4 specializations)
template<typename Ti, typename To, typename = void>
struct ConvertKvCache {
template<int N>
__device__ static auto convert(const Array<Ti, N>& vi);
template<int N>
__device__ auto operator()(const Array<Ti, N>& vi) const;
};
template<class Q, class T>
inline __device__ void StoreQuantParam(T* dst, Array<T, 2> src);
} // namespace turbomind
Import
#include "src/turbomind/kernels/attention/quantization.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| src | Array<Tkv, N> | Yes | Quantized KV cache data |
| scale | T | Yes (for quant types) | Per-token quantization scale |
| zero | T | Yes (for quant types) | Per-token quantization zero point |
| n_bits | int | Yes (for quantize) | Quantization bit width (4 or 8) |
Outputs
| Name | Type | Description |
|---|---|---|
| convert() result | Array<To, N> | Dequantized values (without scale/zero) |
| operator() result | Array<To, N> | Fully dequantized values (with scale/zero) |
| warp_stats output | Array<P,2>[S] | Per-token (scale, zero) parameters |
Usage Examples
// Dequantize INT8 KV data
ConvertKvCache<uint8_t, half> converter(scale, zero);
Array<half, 8> dequantized = converter(quantized_data);
// Compute quantization parameters
Array<half, 2> params[1];
warp_stats<8>(params, kv_data, std::integral_constant<int, 8>{});