Implementation:InternLM Lmdeploy AttentionConfig
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Compile-time configuration traits that select tile sizes, MMA instruction types, mainloop strategies, and cache iterators for prefill attention kernels across SM70, SM75, and SM80 GPU architectures.
Description
AttentionConfig<Arch, T, HeadDim, CacheType> is a template struct whose specializations define the full kernel type for prefill attention on each GPU architecture. Each specialization selects: CTA tile sizes (CTA_Q, CTA_S, WARP_Q, WARP_S), the MMA instruction specialization (MMA_16816 for SM80, MMA_1688 for SM75, MMA_884 for SM70), the pipeline stages, the cache iterator factory (linear or block), and the mainloop type. A common base struct Base_64x64_16x64 provides the default 64x64 CTA with 16x64 warp tiles. The HeadDim=64 specialization uses a larger CTA_S=128 for SM80 to improve occupancy. The CacheType enum (kLinear, kBlock) selects between contiguous and paged KV cache access.
Usage
Used by the attention dispatch layer to obtain the correct kernel type for a given GPU architecture. The Kernel type alias is passed to invokeAttention.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/attention_config.h
- Lines: 1-82
Signature
namespace turbomind::attention {
enum class CacheType { kLinear, kBlock };
template<class Arch, class T, int HeadDim, CacheType cache_type>
struct AttentionConfig {
static_assert(sizeof(T) == 0, "config not found");
};
// SM80 linear cache (generic HeadDim)
template<class T, int HeadDim>
struct AttentionConfig<arch::Sm80, T, HeadDim, CacheType::kLinear> : Base_64x64_16x64 {
using Attention = Impl<MMA_16816, T, T, 1, CTA_Q, CTA_S, 1, WARP_Q, WARP_S, HeadDim, 2>;
using CacheIter = LinearIteratorFactory<T, CTA_S, HeadDim>;
using Kernel = AttentionUniversal<arch::Sm80, Mainloop<Sm80_CpAsync<2>, Attention>, CacheIter, AttentionCtaMap>;
};
// SM80 block cache
template<class T, int HeadDim>
struct AttentionConfig<arch::Sm80, T, HeadDim, CacheType::kBlock> : Base_64x64_16x64 {
using Attention = Impl<MMA_16816, T, T, 1, CTA_Q, CTA_S, 1, WARP_Q, WARP_S, HeadDim, 3>;
using CacheIter = GetBlockIterFactory<T, T, CTA_S, HeadDim>;
using Kernel = AttentionUniversal<arch::Sm80, Mainloop<Sm80_CpAsync<3>, Attention>, CacheIter, AttentionCtaMap>;
};
// SM75 (Turing)
template<class T, int HeadDim, CacheType Ctype>
struct AttentionConfig<arch::Sm75, T, HeadDim, Ctype> : Base_64x64_16x64 {
using Attention = Impl<MMA_1688, T, T, 1, CTA_Q, CTA_S, 1, WARP_Q, WARP_S, HeadDim, 2>;
using CacheIter = GetCacheIterFactory<Ctype, T, CTA_S, HeadDim>;
using Kernel = AttentionUniversal<arch::Sm75, Mainloop<arch::Sm70, Attention>, CacheIter, AttentionCtaMap>;
};
// SM70 (Volta)
template<class T, int HeadDim, CacheType Ctype>
struct AttentionConfig<arch::Sm70, T, HeadDim, Ctype> : Base_64x64_16x64 {
using Attention = Impl<MMA_884, T, T, 1, CTA_Q, CTA_S, 1, WARP_Q, WARP_S, HeadDim, 2>;
using CacheIter = GetCacheIterFactory<Ctype, T, CTA_S, HeadDim>;
using Kernel = AttentionUniversal<arch::Sm70, Mainloop<arch::Sm70, Attention>, CacheIter, AttentionCtaMap>;
};
} // namespace turbomind::attention
Import
#include "src/turbomind/kernels/attention/attention_config.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| Arch | typename | Yes | GPU architecture tag (arch::Sm70, arch::Sm75, arch::Sm80) |
| T | typename | Yes | Data type (half, bfloat16) |
| HeadDim | int | Yes | Attention head dimension (64, 128, 192, 256) |
| cache_type | CacheType | Yes | KV cache type (kLinear or kBlock) |
Outputs
| Name | Type | Description |
|---|---|---|
| Kernel | typename | Fully assembled kernel type for invokeAttention |
| CTA_Q | int | CTA query tile size |
| CTA_S | int | CTA sequence tile size |
Usage Examples
// Get kernel type for SM80 block-cache prefill with HeadDim=128
using Config = AttentionConfig<arch::Sm80, half, 128, CacheType::kBlock>;
invokeAttention<Config::Kernel>(params);