Implementation:InternLM Lmdeploy DecodingConfig
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Compile-time configuration traits that select MMA instruction types, quantized KV support, GQA grouping, and pipeline stages for decoding attention kernels across SM70, SM75, and SM80 GPU architectures.
Description
DecodingConfig<Arch, T, Tkv, Qh, HeadDim> provides kernel type assembly for the decoding phase, where CTA_Q=1 and multiple query heads (Qh) may be processed per CTA. The selection logic is:
- SM80, Qh <= 2, same-type KV: Uses
MMA_SIMTwith 3-stage pipeline. - SM80, Qh > 2, same-type KV: Uses
MMA_81616(tensor core transposed layout) with 3-stage pipeline. Qh is rounded up to the nearest multiple of 8. - SM80, uint8_t KV: Uses
MMA_81616with 5-stage pipeline (HeadDim != 192). HeadDim=192 with uint8 falls back toMMA_SIMT. - SM80, uint4_t KV: Uses
MMA_81616with 5-stage pipeline. - SM75: Uses
MMA_81616with 2-stage pipeline for all configurations. - SM70: Uses
MMA_SIMTwith kH capped at 3 (Qh >= 4 is not beneficial on Volta).
All configurations use block-paged KV cache and DecodingCtaMap.
Usage
Used by the decoding dispatch layer to obtain the correct kernel type. The Decoding<Arch, T, Tkv, Qh, HeadDim> alias directly yields the Kernel type.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/decoding_config.h
- Lines: 1-89
Signature
namespace turbomind::attention {
template<class Arch, class T, class Tkv, int Qh, int HeadDim, class SFINAE = void>
struct DecodingConfig {
static_assert(sizeof(T) == 0, "config not found");
};
template<class Arch, class T, class Tkv, int Qh, int HeadDim>
using Decoding = typename DecodingConfig<Arch, T, Tkv, Qh, HeadDim>::Kernel;
// SM80: Qh <= 2, fp16 KV -> SIMT, 3-stage
template<class T, int Qh, int HeadDim>
struct DecodingConfig<arch::Sm80, T, T, Qh, HeadDim, enable_if_t<!(Qh > 2)>> { ... };
// SM80: Qh > 2, fp16 KV -> MMA_81616, 3-stage
template<class T, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm80, T, T, Qh_, HeadDim, enable_if_t<(Qh_ > 2)>> { ... };
// SM80: uint8_t KV -> MMA_81616, 5-stage
template<class T, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm80, T, uint8_t, Qh_, HeadDim, enable_if_t<(HeadDim != 192)>> { ... };
// SM80: uint4_t KV -> MMA_81616, 5-stage
template<class T, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm80, T, uint4_t, Qh_, HeadDim> { ... };
// SM75: all -> MMA_81616, 2-stage
template<class T, class Tkv, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm75, T, Tkv, Qh_, HeadDim> { ... };
// SM70: all -> SIMT, 2-stage (kH capped at 3)
template<class T, class Tkv, int Qh, int HeadDim>
struct DecodingConfig<arch::Sm70, T, Tkv, Qh, HeadDim> { ... };
} // namespace turbomind::attention
Import
#include "src/turbomind/kernels/attention/decoding_config.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| Arch | typename | Yes | GPU architecture (arch::Sm70, arch::Sm75, arch::Sm80) |
| T | typename | Yes | Compute type (half, bfloat16) |
| Tkv | typename | Yes | KV cache storage type (half, uint8_t, uint4_t) |
| Qh | int | Yes | Number of query heads per KV head group (GQA ratio) |
| HeadDim | int | Yes | Attention head dimension |
Outputs
| Name | Type | Description |
|---|---|---|
| Kernel | typename | Fully assembled decoding kernel type for invokeDecoding |
| Attention | typename | Selected Impl specialization |
| CacheIter | typename | Block iterator factory type |
Usage Examples
// INT8 quantized decoding on SM80 with GQA ratio 4
using Kernel = Decoding<arch::Sm80, half, uint8_t, 4, 128>;
invokeDecoding<Kernel>(params);