Implementation:InternLM Lmdeploy AttentionQuantization

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Comprehensive KV cache quantization library providing type conversion kernels, min/max statistics, and dequantization for uint8, uint4, fp8_e4m3, and fp4_e2m1 formats across half, bfloat16, and float compute types.

Description

This header implements the quantization and dequantization infrastructure for KV cache compression. Key components include:

Type converters: cvt_f16x4_u8, cvt_bf16x4_u8, cvt_f32x4_u8 for uint8-to-float conversion using bit manipulation and magic number techniques; cvt_bf16x8_u4/cvt_f16x8_u4 for uint4 using lop3.b32 inline PTX; cvt_bf16x8_e2m1/cvt_f16x8_e2m1 for fp4; cvt_bf16x4_e4m3/cvt_f16x4_e4m3 for fp8.
Quantization: warp_stats computes per-token min/max via warp-level shuffle reductions, quantize applies scale/zero-point quantization, and round uses PTX cvt.rni for banker's rounding.
ConvertKvCache: Template struct family with specializations for all (storage, compute) type pairs. Each provides a convert static method for raw dequantization and an operator() that applies scale and zero-point. The identity specialization (ConvertKvCache<T,T>) is a pass-through.
StoreQuantParam: Helper for writing quantization parameters with optional fused dequant preprocessing for uint4+fp16.

Usage

Used throughout the attention kernel implementations (Impl_81616, Impl_Simt) for on-the-fly KV cache dequantization, and by AttentionUniversal::Prologue for quantizing new K/V entries into the cache.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/quantization.h
Lines: 1-888

Signature

namespace turbomind {

// Type conversion helpers
inline __device__ Array<half, 4> cvt_f16x4_u8(const Array<uint8_t, 4>& src);
inline __device__ Array<nv_bfloat16, 4> cvt_bf16x4_u8(const Array<uint8_t, 4>& src);
inline __device__ Array<float, 4> cvt_f32x4_u8(const Array<uint8_t, 4>& src);

template<bool norm>
inline __device__ Array<nv_bfloat16, 8> cvt_bf16x8_u4(const Array<uint4_t, 8>& src);

// Rounding
template<class T> inline __device__ T round(float x);
template<class T> inline __device__ T round(half x);

// Warp-level statistics
template<int WarpThreadC, class T, int C>
__device__ void warp_minmax(Array<T,2>& stats, const Array<T,C>& x);

template<int WarpThreadC, class P, class T, class B, int N, int C, int S>
__device__ void warp_stats(Array<P,2> (&param)[S], const Array<T,N> (&x)[S][C], B n_bits);

// KV cache converter (generic, identity, u8, u4, fp8, fp4 specializations)
template<typename Ti, typename To, typename = void>
struct ConvertKvCache {
    template<int N>
    __device__ static auto convert(const Array<Ti, N>& vi);
    template<int N>
    __device__ auto operator()(const Array<Ti, N>& vi) const;
};

template<class Q, class T>
inline __device__ void StoreQuantParam(T* dst, Array<T, 2> src);

} // namespace turbomind

Import

#include "src/turbomind/kernels/attention/quantization.h"

I/O Contract

Inputs

Name	Type	Required	Description
src	Array<Tkv, N>	Yes	Quantized KV cache data
scale	T	Yes (for quant types)	Per-token quantization scale
zero	T	Yes (for quant types)	Per-token quantization zero point
n_bits	int	Yes (for quantize)	Quantization bit width (4 or 8)

Outputs

Name	Type	Description
convert() result	Array<To, N>	Dequantized values (without scale/zero)
operator() result	Array<To, N>	Fully dequantized values (with scale/zero)
warp_stats output	Array<P,2>[S]	Per-token (scale, zero) parameters

Usage Examples

// Dequantize INT8 KV data
ConvertKvCache<uint8_t, half> converter(scale, zero);
Array<half, 8> dequantized = converter(quantized_data);

// Compute quantization parameters
Array<half, 2> params[1];
warp_stats<8>(params, kv_data, std::integral_constant<int, 8>{});

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment