Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy DecodingConfig

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Attention
Last Updated 2026-02-07 15:00 GMT

Overview

Compile-time configuration traits that select MMA instruction types, quantized KV support, GQA grouping, and pipeline stages for decoding attention kernels across SM70, SM75, and SM80 GPU architectures.

Description

DecodingConfig<Arch, T, Tkv, Qh, HeadDim> provides kernel type assembly for the decoding phase, where CTA_Q=1 and multiple query heads (Qh) may be processed per CTA. The selection logic is:

  • SM80, Qh <= 2, same-type KV: Uses MMA_SIMT with 3-stage pipeline.
  • SM80, Qh > 2, same-type KV: Uses MMA_81616 (tensor core transposed layout) with 3-stage pipeline. Qh is rounded up to the nearest multiple of 8.
  • SM80, uint8_t KV: Uses MMA_81616 with 5-stage pipeline (HeadDim != 192). HeadDim=192 with uint8 falls back to MMA_SIMT.
  • SM80, uint4_t KV: Uses MMA_81616 with 5-stage pipeline.
  • SM75: Uses MMA_81616 with 2-stage pipeline for all configurations.
  • SM70: Uses MMA_SIMT with kH capped at 3 (Qh >= 4 is not beneficial on Volta).

All configurations use block-paged KV cache and DecodingCtaMap.

Usage

Used by the decoding dispatch layer to obtain the correct kernel type. The Decoding<Arch, T, Tkv, Qh, HeadDim> alias directly yields the Kernel type.

Code Reference

Source Location

Signature

namespace turbomind::attention {

template<class Arch, class T, class Tkv, int Qh, int HeadDim, class SFINAE = void>
struct DecodingConfig {
    static_assert(sizeof(T) == 0, "config not found");
};

template<class Arch, class T, class Tkv, int Qh, int HeadDim>
using Decoding = typename DecodingConfig<Arch, T, Tkv, Qh, HeadDim>::Kernel;

// SM80: Qh <= 2, fp16 KV -> SIMT, 3-stage
template<class T, int Qh, int HeadDim>
struct DecodingConfig<arch::Sm80, T, T, Qh, HeadDim, enable_if_t<!(Qh > 2)>> { ... };

// SM80: Qh > 2, fp16 KV -> MMA_81616, 3-stage
template<class T, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm80, T, T, Qh_, HeadDim, enable_if_t<(Qh_ > 2)>> { ... };

// SM80: uint8_t KV -> MMA_81616, 5-stage
template<class T, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm80, T, uint8_t, Qh_, HeadDim, enable_if_t<(HeadDim != 192)>> { ... };

// SM80: uint4_t KV -> MMA_81616, 5-stage
template<class T, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm80, T, uint4_t, Qh_, HeadDim> { ... };

// SM75: all -> MMA_81616, 2-stage
template<class T, class Tkv, int Qh_, int HeadDim>
struct DecodingConfig<arch::Sm75, T, Tkv, Qh_, HeadDim> { ... };

// SM70: all -> SIMT, 2-stage (kH capped at 3)
template<class T, class Tkv, int Qh, int HeadDim>
struct DecodingConfig<arch::Sm70, T, Tkv, Qh, HeadDim> { ... };

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/decoding_config.h"

I/O Contract

Inputs

Name Type Required Description
Arch typename Yes GPU architecture (arch::Sm70, arch::Sm75, arch::Sm80)
T typename Yes Compute type (half, bfloat16)
Tkv typename Yes KV cache storage type (half, uint8_t, uint4_t)
Qh int Yes Number of query heads per KV head group (GQA ratio)
HeadDim int Yes Attention head dimension

Outputs

Name Type Description
Kernel typename Fully assembled decoding kernel type for invokeDecoding
Attention typename Selected Impl specialization
CacheIter typename Block iterator factory type

Usage Examples

// INT8 quantized decoding on SM80 with GQA ratio 4
using Kernel = Decoding<arch::Sm80, half, uint8_t, 4, 128>;
invokeDecoding<Kernel>(params);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment