Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy AttentionQuantization

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Attention
Last Updated 2026-02-07 15:00 GMT

Overview

Comprehensive KV cache quantization library providing type conversion kernels, min/max statistics, and dequantization for uint8, uint4, fp8_e4m3, and fp4_e2m1 formats across half, bfloat16, and float compute types.

Description

This header implements the quantization and dequantization infrastructure for KV cache compression. Key components include:

  • Type converters: cvt_f16x4_u8, cvt_bf16x4_u8, cvt_f32x4_u8 for uint8-to-float conversion using bit manipulation and magic number techniques; cvt_bf16x8_u4/cvt_f16x8_u4 for uint4 using lop3.b32 inline PTX; cvt_bf16x8_e2m1/cvt_f16x8_e2m1 for fp4; cvt_bf16x4_e4m3/cvt_f16x4_e4m3 for fp8.
  • Quantization: warp_stats computes per-token min/max via warp-level shuffle reductions, quantize applies scale/zero-point quantization, and round uses PTX cvt.rni for banker's rounding.
  • ConvertKvCache: Template struct family with specializations for all (storage, compute) type pairs. Each provides a convert static method for raw dequantization and an operator() that applies scale and zero-point. The identity specialization (ConvertKvCache<T,T>) is a pass-through.
  • StoreQuantParam: Helper for writing quantization parameters with optional fused dequant preprocessing for uint4+fp16.

Usage

Used throughout the attention kernel implementations (Impl_81616, Impl_Simt) for on-the-fly KV cache dequantization, and by AttentionUniversal::Prologue for quantizing new K/V entries into the cache.

Code Reference

Source Location

Signature

namespace turbomind {

// Type conversion helpers
inline __device__ Array<half, 4> cvt_f16x4_u8(const Array<uint8_t, 4>& src);
inline __device__ Array<nv_bfloat16, 4> cvt_bf16x4_u8(const Array<uint8_t, 4>& src);
inline __device__ Array<float, 4> cvt_f32x4_u8(const Array<uint8_t, 4>& src);

template<bool norm>
inline __device__ Array<nv_bfloat16, 8> cvt_bf16x8_u4(const Array<uint4_t, 8>& src);

// Rounding
template<class T> inline __device__ T round(float x);
template<class T> inline __device__ T round(half x);

// Warp-level statistics
template<int WarpThreadC, class T, int C>
__device__ void warp_minmax(Array<T,2>& stats, const Array<T,C>& x);

template<int WarpThreadC, class P, class T, class B, int N, int C, int S>
__device__ void warp_stats(Array<P,2> (&param)[S], const Array<T,N> (&x)[S][C], B n_bits);

// KV cache converter (generic, identity, u8, u4, fp8, fp4 specializations)
template<typename Ti, typename To, typename = void>
struct ConvertKvCache {
    template<int N>
    __device__ static auto convert(const Array<Ti, N>& vi);
    template<int N>
    __device__ auto operator()(const Array<Ti, N>& vi) const;
};

template<class Q, class T>
inline __device__ void StoreQuantParam(T* dst, Array<T, 2> src);

} // namespace turbomind

Import

#include "src/turbomind/kernels/attention/quantization.h"

I/O Contract

Inputs

Name Type Required Description
src Array<Tkv, N> Yes Quantized KV cache data
scale T Yes (for quant types) Per-token quantization scale
zero T Yes (for quant types) Per-token quantization zero point
n_bits int Yes (for quantize) Quantization bit width (4 or 8)

Outputs

Name Type Description
convert() result Array<To, N> Dequantized values (without scale/zero)
operator() result Array<To, N> Fully dequantized values (with scale/zero)
warp_stats output Array<P,2>[S] Per-token (scale, zero) parameters

Usage Examples

// Dequantize INT8 KV data
ConvertKvCache<uint8_t, half> converter(scale, zero);
Array<half, 8> dequantized = converter(quantized_data);

// Compute quantization parameters
Array<half, 2> params[1];
warp_stats<8>(params, kv_data, std::integral_constant<int, 8>{});

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment