Implementation:InternLM Lmdeploy AttentionUniversal
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Core fused multi-head attention kernel struct implementing FlashAttention-style tiled attention with online softmax, supporting both prefill and decoding phases across multiple GPU architectures.
Description
AttentionUniversal is the top-level device-side kernel body, templated on architecture, mainloop strategy, cache iterator factory, and CTA mapping. It orchestrates: (1) a Prologue that loads Q/K/V from global memory, applies bias, rotary positional embedding (RoPE), and optional log-n attention scaling, then quantizes and writes K/V to the paged cache; (2) a tiled Mainloop that processes KV tiles in reverse order with online softmax (FlashAttention-2 algorithm) using masked/unmasked/window-masked passes; (3) an Epilogue that merges partial results across warps and either writes the final normalized output or stores partial O/M/L for split-K reduction. The global kernel function attention_kernel dispatches into this struct.
Usage
Not called directly. Instantiated as part of an AttentionConfig or DecodingConfig type alias and launched via invokeAttention or invokeDecoding.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/attention_universal.h
- Lines: 1-598
Signature
template<class Arch_, class Mainloop, class CacheIteratorFactory_, class CtaMap_>
struct AttentionUniversal {
using T = typename Mainloop::T;
using Tkv = typename Mainloop::Tkv;
using Impl = typename Mainloop::Impl;
using CacheIteratorFactory = CacheIteratorFactory_;
using CtaMap = CtaMap_;
using Arch = Arch_;
static constexpr int kWarpCount = Impl::kWarpCount;
using ParamType = AttentionParams<T>;
static constexpr int kHeadDim = Impl::kHeadDim;
static constexpr int CTA_H = Impl::CTA_H;
static constexpr int CTA_Q = Impl::CTA_Q;
static constexpr int CTA_S = Impl::CTA_S;
using SharedStorage = typename Mainloop::SharedStorage;
static constexpr bool kProcessKV = CTA_Q == 1;
// Device-side operator
__device__ void operator()(
const ParamType& params,
CacheIteratorFactory& cache_iter_factory,
const CtaMap& cta_map,
char* smem_buf);
};
template<class Kernel>
__global__ void attention_kernel(
typename Kernel::ParamType params,
typename Kernel::CacheIteratorFactory cache_iter_factory,
typename Kernel::CtaMap cta_map,
int q_group_size, int q_head_per_cta, int cta_per_q_group);
Import
#include "src/turbomind/kernels/attention/attention_universal.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| params | AttentionParams<T> | Yes | Kernel parameters with Q/K/V buffers and metadata |
| cache_iter_factory | CacheIteratorFactory | Yes | Factory for creating KV cache tile iterators |
| cta_map | CtaMap | Yes | Mapping from CUDA grid to query/batch/head/split indices |
| smem_buf | char* | Yes | Dynamic shared memory buffer |
Outputs
| Name | Type | Description |
|---|---|---|
| params.out | T* | Final attention output (when no split-K needed) |
| params.partial_O | float* | Partial unnormalized output per split |
| params.partial_ML | float* | Partial max and log-sum-exp per split |
Usage Examples
// Typically instantiated via config, not directly:
using Kernel = AttentionUniversal<
arch::Sm80,
Mainloop<Sm80_CpAsync<3>, Impl<MMA_16816, half, half, 1, 64, 64, 1, 16, 64, 128, 3>>,
GetBlockIterFactory<half, half, 64, 128>,
AttentionCtaMap>;
// Launched by invokeAttention<Kernel>(params);