Implementation:InternLM Lmdeploy AttentionUniversal

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Core fused multi-head attention kernel struct implementing FlashAttention-style tiled attention with online softmax, supporting both prefill and decoding phases across multiple GPU architectures.

Description

AttentionUniversal is the top-level device-side kernel body, templated on architecture, mainloop strategy, cache iterator factory, and CTA mapping. It orchestrates: (1) a Prologue that loads Q/K/V from global memory, applies bias, rotary positional embedding (RoPE), and optional log-n attention scaling, then quantizes and writes K/V to the paged cache; (2) a tiled Mainloop that processes KV tiles in reverse order with online softmax (FlashAttention-2 algorithm) using masked/unmasked/window-masked passes; (3) an Epilogue that merges partial results across warps and either writes the final normalized output or stores partial O/M/L for split-K reduction. The global kernel function attention_kernel dispatches into this struct.

Usage

Not called directly. Instantiated as part of an AttentionConfig or DecodingConfig type alias and launched via invokeAttention or invokeDecoding.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/attention_universal.h
Lines: 1-598

Signature

template<class Arch_, class Mainloop, class CacheIteratorFactory_, class CtaMap_>
struct AttentionUniversal {
    using T   = typename Mainloop::T;
    using Tkv = typename Mainloop::Tkv;
    using Impl = typename Mainloop::Impl;

    using CacheIteratorFactory = CacheIteratorFactory_;
    using CtaMap               = CtaMap_;
    using Arch = Arch_;

    static constexpr int kWarpCount = Impl::kWarpCount;
    using ParamType = AttentionParams<T>;
    static constexpr int kHeadDim = Impl::kHeadDim;

    static constexpr int CTA_H = Impl::CTA_H;
    static constexpr int CTA_Q = Impl::CTA_Q;
    static constexpr int CTA_S = Impl::CTA_S;

    using SharedStorage = typename Mainloop::SharedStorage;
    static constexpr bool kProcessKV = CTA_Q == 1;

    // Device-side operator
    __device__ void operator()(
        const ParamType& params,
        CacheIteratorFactory& cache_iter_factory,
        const CtaMap& cta_map,
        char* smem_buf);
};

template<class Kernel>
__global__ void attention_kernel(
    typename Kernel::ParamType            params,
    typename Kernel::CacheIteratorFactory cache_iter_factory,
    typename Kernel::CtaMap               cta_map,
    int q_group_size, int q_head_per_cta, int cta_per_q_group);

Import

#include "src/turbomind/kernels/attention/attention_universal.h"

I/O Contract

Inputs

Name	Type	Required	Description
params	AttentionParams<T>	Yes	Kernel parameters with Q/K/V buffers and metadata
cache_iter_factory	CacheIteratorFactory	Yes	Factory for creating KV cache tile iterators
cta_map	CtaMap	Yes	Mapping from CUDA grid to query/batch/head/split indices
smem_buf	char*	Yes	Dynamic shared memory buffer

Outputs

Name	Type	Description
params.out	T*	Final attention output (when no split-K needed)
params.partial_O	float*	Partial unnormalized output per split
params.partial_ML	float*	Partial max and log-sum-exp per split

Usage Examples

// Typically instantiated via config, not directly:
using Kernel = AttentionUniversal<
    arch::Sm80,
    Mainloop<Sm80_CpAsync<3>, Impl<MMA_16816, half, half, 1, 64, 64, 1, 16, 64, 128, 3>>,
    GetBlockIterFactory<half, half, 64, 128>,
    AttentionCtaMap>;
// Launched by invokeAttention<Kernel>(params);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment