Implementation:InternLM Lmdeploy DecodingConfig

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Compile-time configuration traits that select MMA instruction types, quantized KV support, GQA grouping, and pipeline stages for decoding attention kernels across SM70, SM75, and SM80 GPU architectures.

Description

DecodingConfig<Arch, T, Tkv, Qh, HeadDim> provides kernel type assembly for the decoding phase, where CTA_Q=1 and multiple query heads (Qh) may be processed per CTA. The selection logic is:

SM80, Qh <= 2, same-type KV: Uses MMA_SIMT with 3-stage pipeline.
SM80, Qh > 2, same-type KV: Uses MMA_81616 (tensor core transposed layout) with 3-stage pipeline. Qh is rounded up to the nearest multiple of 8.
SM80, uint8_t KV: Uses MMA_81616 with 5-stage pipeline (HeadDim != 192). HeadDim=192 with uint8 falls back to MMA_SIMT.
SM80, uint4_t KV: Uses MMA_81616 with 5-stage pipeline.
SM75: Uses MMA_81616 with 2-stage pipeline for all configurations.
SM70: Uses MMA_SIMT with kH capped at 3 (Qh >= 4 is not beneficial on Volta).

All configurations use block-paged KV cache and DecodingCtaMap.

Usage

Used by the decoding dispatch layer to obtain the correct kernel type. The Decoding<Arch, T, Tkv, Qh, HeadDim> alias directly yields the Kernel type.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/decoding_config.h
Lines: 1-89

Signature

namespace turbomind::attention {

template<class Arch, class T, class Tkv, int Qh, int HeadDim, class SFINAE = void>
struct DecodingConfig {
    static_assert(sizeof(T) == 0, "config not found");
};

template<class Arch, class T, class Tkv, int Qh, int HeadDim>
using Decoding = typename DecodingConfig<Arch, T, Tkv, Qh, HeadDim>::Kernel;

// SM80: Qh <= 2, fp16 KV -> SIMT, 3-stage
template<class T, int Qh, int HeadDim>
struct DecodingConfig<arch::Sm80, T, T, Qh, HeadDim, enable_if_t<!(Qh > 2)>> { ... };

// SM80: Qh > 2, fp16 KV -> MMA_81616, 3-stage
template<class T, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm80, T, T, Qh_, HeadDim, enable_if_t<(Qh_ > 2)>> { ... };

// SM80: uint8_t KV -> MMA_81616, 5-stage
template<class T, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm80, T, uint8_t, Qh_, HeadDim, enable_if_t<(HeadDim != 192)>> { ... };

// SM80: uint4_t KV -> MMA_81616, 5-stage
template<class T, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm80, T, uint4_t, Qh_, HeadDim> { ... };

// SM75: all -> MMA_81616, 2-stage
template<class T, class Tkv, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm75, T, Tkv, Qh_, HeadDim> { ... };

// SM70: all -> SIMT, 2-stage (kH capped at 3)
template<class T, class Tkv, int Qh, int HeadDim>
struct DecodingConfig<arch::Sm70, T, Tkv, Qh, HeadDim> { ... };

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/decoding_config.h"

I/O Contract

Inputs

Name	Type	Required	Description
Arch	typename	Yes	GPU architecture (arch::Sm70, arch::Sm75, arch::Sm80)
T	typename	Yes	Compute type (half, bfloat16)
Tkv	typename	Yes	KV cache storage type (half, uint8_t, uint4_t)
Qh	int	Yes	Number of query heads per KV head group (GQA ratio)
HeadDim	int	Yes	Attention head dimension

Outputs

Name	Type	Description
Kernel	typename	Fully assembled decoding kernel type for invokeDecoding
Attention	typename	Selected Impl specialization
CacheIter	typename	Block iterator factory type

Usage Examples

// INT8 quantized decoding on SM80 with GQA ratio 4
using Kernel = Decoding<arch::Sm80, half, uint8_t, 4, 128>;
invokeDecoding<Kernel>(params);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment