Implementation:InternLM Lmdeploy AttentionConfig

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Compile-time configuration traits that select tile sizes, MMA instruction types, mainloop strategies, and cache iterators for prefill attention kernels across SM70, SM75, and SM80 GPU architectures.

Description

AttentionConfig<Arch, T, HeadDim, CacheType> is a template struct whose specializations define the full kernel type for prefill attention on each GPU architecture. Each specialization selects: CTA tile sizes (CTA_Q, CTA_S, WARP_Q, WARP_S), the MMA instruction specialization (MMA_16816 for SM80, MMA_1688 for SM75, MMA_884 for SM70), the pipeline stages, the cache iterator factory (linear or block), and the mainloop type. A common base struct Base_64x64_16x64 provides the default 64x64 CTA with 16x64 warp tiles. The HeadDim=64 specialization uses a larger CTA_S=128 for SM80 to improve occupancy. The CacheType enum (kLinear, kBlock) selects between contiguous and paged KV cache access.

Usage

Used by the attention dispatch layer to obtain the correct kernel type for a given GPU architecture. The Kernel type alias is passed to invokeAttention.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/attention_config.h
Lines: 1-82

Signature

namespace turbomind::attention {

enum class CacheType { kLinear, kBlock };

template<class Arch, class T, int HeadDim, CacheType cache_type>
struct AttentionConfig {
    static_assert(sizeof(T) == 0, "config not found");
};

// SM80 linear cache (generic HeadDim)
template<class T, int HeadDim>
struct AttentionConfig<arch::Sm80, T, HeadDim, CacheType::kLinear> : Base_64x64_16x64 {
    using Attention = Impl<MMA_16816, T, T, 1, CTA_Q, CTA_S, 1, WARP_Q, WARP_S, HeadDim, 2>;
    using CacheIter = LinearIteratorFactory<T, CTA_S, HeadDim>;
    using Kernel = AttentionUniversal<arch::Sm80, Mainloop<Sm80_CpAsync<2>, Attention>, CacheIter, AttentionCtaMap>;
};

// SM80 block cache
template<class T, int HeadDim>
struct AttentionConfig<arch::Sm80, T, HeadDim, CacheType::kBlock> : Base_64x64_16x64 {
    using Attention = Impl<MMA_16816, T, T, 1, CTA_Q, CTA_S, 1, WARP_Q, WARP_S, HeadDim, 3>;
    using CacheIter = GetBlockIterFactory<T, T, CTA_S, HeadDim>;
    using Kernel = AttentionUniversal<arch::Sm80, Mainloop<Sm80_CpAsync<3>, Attention>, CacheIter, AttentionCtaMap>;
};

// SM75 (Turing)
template<class T, int HeadDim, CacheType Ctype>
struct AttentionConfig<arch::Sm75, T, HeadDim, Ctype> : Base_64x64_16x64 {
    using Attention = Impl<MMA_1688, T, T, 1, CTA_Q, CTA_S, 1, WARP_Q, WARP_S, HeadDim, 2>;
    using CacheIter = GetCacheIterFactory<Ctype, T, CTA_S, HeadDim>;
    using Kernel = AttentionUniversal<arch::Sm75, Mainloop<arch::Sm70, Attention>, CacheIter, AttentionCtaMap>;
};

// SM70 (Volta)
template<class T, int HeadDim, CacheType Ctype>
struct AttentionConfig<arch::Sm70, T, HeadDim, Ctype> : Base_64x64_16x64 {
    using Attention = Impl<MMA_884, T, T, 1, CTA_Q, CTA_S, 1, WARP_Q, WARP_S, HeadDim, 2>;
    using CacheIter = GetCacheIterFactory<Ctype, T, CTA_S, HeadDim>;
    using Kernel = AttentionUniversal<arch::Sm70, Mainloop<arch::Sm70, Attention>, CacheIter, AttentionCtaMap>;
};

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/attention_config.h"

I/O Contract

Inputs

Name	Type	Required	Description
Arch	typename	Yes	GPU architecture tag (arch::Sm70, arch::Sm75, arch::Sm80)
T	typename	Yes	Data type (half, bfloat16)
HeadDim	int	Yes	Attention head dimension (64, 128, 192, 256)
cache_type	CacheType	Yes	KV cache type (kLinear or kBlock)

Outputs

Name	Type	Description
Kernel	typename	Fully assembled kernel type for invokeAttention
CTA_Q	int	CTA query tile size
CTA_S	int	CTA sequence tile size

Usage Examples

// Get kernel type for SM80 block-cache prefill with HeadDim=128
using Config = AttentionConfig<arch::Sm80, half, 128, CacheType::kBlock>;
invokeAttention<Config::Kernel>(params);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment