Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy AttentionUniversal

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Attention
Last Updated 2026-02-07 15:00 GMT

Overview

Core fused multi-head attention kernel struct implementing FlashAttention-style tiled attention with online softmax, supporting both prefill and decoding phases across multiple GPU architectures.

Description

AttentionUniversal is the top-level device-side kernel body, templated on architecture, mainloop strategy, cache iterator factory, and CTA mapping. It orchestrates: (1) a Prologue that loads Q/K/V from global memory, applies bias, rotary positional embedding (RoPE), and optional log-n attention scaling, then quantizes and writes K/V to the paged cache; (2) a tiled Mainloop that processes KV tiles in reverse order with online softmax (FlashAttention-2 algorithm) using masked/unmasked/window-masked passes; (3) an Epilogue that merges partial results across warps and either writes the final normalized output or stores partial O/M/L for split-K reduction. The global kernel function attention_kernel dispatches into this struct.

Usage

Not called directly. Instantiated as part of an AttentionConfig or DecodingConfig type alias and launched via invokeAttention or invokeDecoding.

Code Reference

Source Location

Signature

template<class Arch_, class Mainloop, class CacheIteratorFactory_, class CtaMap_>
struct AttentionUniversal {
    using T   = typename Mainloop::T;
    using Tkv = typename Mainloop::Tkv;
    using Impl = typename Mainloop::Impl;

    using CacheIteratorFactory = CacheIteratorFactory_;
    using CtaMap               = CtaMap_;
    using Arch = Arch_;

    static constexpr int kWarpCount = Impl::kWarpCount;
    using ParamType = AttentionParams<T>;
    static constexpr int kHeadDim = Impl::kHeadDim;

    static constexpr int CTA_H = Impl::CTA_H;
    static constexpr int CTA_Q = Impl::CTA_Q;
    static constexpr int CTA_S = Impl::CTA_S;

    using SharedStorage = typename Mainloop::SharedStorage;
    static constexpr bool kProcessKV = CTA_Q == 1;

    // Device-side operator
    __device__ void operator()(
        const ParamType& params,
        CacheIteratorFactory& cache_iter_factory,
        const CtaMap& cta_map,
        char* smem_buf);
};

template<class Kernel>
__global__ void attention_kernel(
    typename Kernel::ParamType            params,
    typename Kernel::CacheIteratorFactory cache_iter_factory,
    typename Kernel::CtaMap               cta_map,
    int q_group_size, int q_head_per_cta, int cta_per_q_group);

Import

#include "src/turbomind/kernels/attention/attention_universal.h"

I/O Contract

Inputs

Name Type Required Description
params AttentionParams<T> Yes Kernel parameters with Q/K/V buffers and metadata
cache_iter_factory CacheIteratorFactory Yes Factory for creating KV cache tile iterators
cta_map CtaMap Yes Mapping from CUDA grid to query/batch/head/split indices
smem_buf char* Yes Dynamic shared memory buffer

Outputs

Name Type Description
params.out T* Final attention output (when no split-K needed)
params.partial_O float* Partial unnormalized output per split
params.partial_ML float* Partial max and log-sum-exp per split

Usage Examples

// Typically instantiated via config, not directly:
using Kernel = AttentionUniversal<
    arch::Sm80,
    Mainloop<Sm80_CpAsync<3>, Impl<MMA_16816, half, half, 1, 64, 64, 1, 16, 64, 128, 3>>,
    GetBlockIterFactory<half, half, 64, 128>,
    AttentionCtaMap>;
// Launched by invokeAttention<Kernel>(params);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment