Implementation:InternLM Lmdeploy Impl 81616

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Attention implementation using m16n8k16 MMA with transposed operand layout optimized for INT8/INT4 quantized KV cache, supporting multi-head query grouping (GQA) in the decoding phase.

Description

This is the MMA_81616 specialization of Impl, designed for decoding attention with quantized KV caches (uint8_t, uint4_t, fp8_e4m3_t, fp4_e2m1_t). Unlike the prefill-oriented 16816 impl, this transposes the QK computation so that K is the M-operand and Q is the N-operand, enabling efficient multi-head query processing (CTA_H > 1) within a single warp. The implementation handles on-the-fly dequantization of sub-byte KV data using ConvertKvCache, loads quantization parameters from dedicated shared memory regions (SmemLayoutKVp), and performs cross-warp reduction of M/L/O via shared memory in the Merge step. Output is written through a staging buffer in shared memory (O1) for coalesced global writes.

Usage

Selected by DecodingConfig when GQA group size > 2 on SM80, or for all quantized KV cache configurations on SM75/SM80. Always used with CTA_Q=1.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/impl_81616.h
Lines: 1-778

Signature

namespace turbomind::attention {

template<class T_, class Tkv_, int CTA_H_, int CTA_Q_, int CTA_S_,
         int WARP_H_, int WARP_Q, int WARP_S, int HeadDim, int Stages>
struct Impl<MMA_81616, T_, Tkv_, CTA_H_, CTA_Q_, CTA_S_,
            WARP_H_, WARP_Q, WARP_S, HeadDim, Stages> {

    using T = T_;
    using Tkv = Tkv_;
    static constexpr int kQuantKV = !std::is_same_v<T, Tkv>;

    // MMA operand sizes (transposed: K is M, Q is N)
    static constexpr int OP_M = 16;
    static constexpr int OP_N = 8;
    static constexpr int OP_K = 16;

    using FragK = Array<T, 8>[K_K][K_M];
    using FragQ = Array<T, 4>[K_N][K_K];
    using FragS = Array<float, 4>[K_M][K_N];
    using FragV = Array<T, 8>[V_M][V_K];
    using FragP = Array<T, 4>[V_K][V_N];
    using FragO = Array<float, 4>[V_M][V_N];

    // Dequantization data types
    using DataK = Array<Tkv, 8*X>[K_K/X][K_M];
    using ParamK = Array<T, 2>[K_M][2];

    union SharedStorage {
        T Q[SmemLayoutQ::kSize];
        struct { Array<Tkv, Stages*SmemLayoutK::kSize> KV; T KVp[...]; };
        struct { SmemM M; SmemM L; SmemO O; };
        float O1[CTA_H1][kHeadDim];
    };

    struct StateQK { ... };
    struct StatePV { ... };

    static void Softmax<is_residue>(FragS&, FragM&, FragM&, FragO&, float);
    static void ConvertStoP(FragS&, FragP&, SharedStorage&);
    static void Merge(FragO&, FragM&, FragL&, float, SharedStorage&);
    static void StoreO<is_norm>(FragO&, FragL&, SharedStorage&, Func&&);
};

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/impl_81616.h"

I/O Contract

Inputs

Name	Type	Required	Description
T_	typename	Yes	Compute type (half, bfloat16)
Tkv_	typename	Yes	KV cache storage type (half, uint8_t, uint4_t, fp8_e4m3_t, fp4_e2m1_t)
CTA_H_	int	Yes	Number of query heads processed per CTA
HeadDim	int	Yes	Head dimension
Stages	int	Yes	Pipeline stages (2, 3, or 5)

Outputs

Name	Type	Description
FragO	Array<float,4>[V_M][V_N]	Accumulated output fragments across all heads in CTA
FragM	Array<float,2>[K_N]	Per-head running max
FragL	Array<float,2>[K_N]	Per-head running sum

Usage Examples

// INT8 KV decoding with 8 query heads per group on SM80
using Attention = Impl<MMA_81616, half, uint8_t, 8, 1, 64, 8, 1, 16, 128, 5>;
using Kernel = AttentionUniversal<arch::Sm80,
    Mainloop<Sm80_CpAsync<5>, Attention>,
    GetBlockIterFactory<half, uint8_t, 64, 128>,
    DecodingCtaMap>;

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment