Implementation:InternLM Lmdeploy Impl 81616
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Attention implementation using m16n8k16 MMA with transposed operand layout optimized for INT8/INT4 quantized KV cache, supporting multi-head query grouping (GQA) in the decoding phase.
Description
This is the MMA_81616 specialization of Impl, designed for decoding attention with quantized KV caches (uint8_t, uint4_t, fp8_e4m3_t, fp4_e2m1_t). Unlike the prefill-oriented 16816 impl, this transposes the QK computation so that K is the M-operand and Q is the N-operand, enabling efficient multi-head query processing (CTA_H > 1) within a single warp. The implementation handles on-the-fly dequantization of sub-byte KV data using ConvertKvCache, loads quantization parameters from dedicated shared memory regions (SmemLayoutKVp), and performs cross-warp reduction of M/L/O via shared memory in the Merge step. Output is written through a staging buffer in shared memory (O1) for coalesced global writes.
Usage
Selected by DecodingConfig when GQA group size > 2 on SM80, or for all quantized KV cache configurations on SM75/SM80. Always used with CTA_Q=1.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/impl_81616.h
- Lines: 1-778
Signature
namespace turbomind::attention {
template<class T_, class Tkv_, int CTA_H_, int CTA_Q_, int CTA_S_,
int WARP_H_, int WARP_Q, int WARP_S, int HeadDim, int Stages>
struct Impl<MMA_81616, T_, Tkv_, CTA_H_, CTA_Q_, CTA_S_,
WARP_H_, WARP_Q, WARP_S, HeadDim, Stages> {
using T = T_;
using Tkv = Tkv_;
static constexpr int kQuantKV = !std::is_same_v<T, Tkv>;
// MMA operand sizes (transposed: K is M, Q is N)
static constexpr int OP_M = 16;
static constexpr int OP_N = 8;
static constexpr int OP_K = 16;
using FragK = Array<T, 8>[K_K][K_M];
using FragQ = Array<T, 4>[K_N][K_K];
using FragS = Array<float, 4>[K_M][K_N];
using FragV = Array<T, 8>[V_M][V_K];
using FragP = Array<T, 4>[V_K][V_N];
using FragO = Array<float, 4>[V_M][V_N];
// Dequantization data types
using DataK = Array<Tkv, 8*X>[K_K/X][K_M];
using ParamK = Array<T, 2>[K_M][2];
union SharedStorage {
T Q[SmemLayoutQ::kSize];
struct { Array<Tkv, Stages*SmemLayoutK::kSize> KV; T KVp[...]; };
struct { SmemM M; SmemM L; SmemO O; };
float O1[CTA_H1][kHeadDim];
};
struct StateQK { ... };
struct StatePV { ... };
static void Softmax<is_residue>(FragS&, FragM&, FragM&, FragO&, float);
static void ConvertStoP(FragS&, FragP&, SharedStorage&);
static void Merge(FragO&, FragM&, FragL&, float, SharedStorage&);
static void StoreO<is_norm>(FragO&, FragL&, SharedStorage&, Func&&);
};
} // namespace turbomind::attention
Import
#include "src/turbomind/kernels/attention/impl_81616.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| T_ | typename | Yes | Compute type (half, bfloat16) |
| Tkv_ | typename | Yes | KV cache storage type (half, uint8_t, uint4_t, fp8_e4m3_t, fp4_e2m1_t) |
| CTA_H_ | int | Yes | Number of query heads processed per CTA |
| HeadDim | int | Yes | Head dimension |
| Stages | int | Yes | Pipeline stages (2, 3, or 5) |
Outputs
| Name | Type | Description |
|---|---|---|
| FragO | Array<float,4>[V_M][V_N] | Accumulated output fragments across all heads in CTA |
| FragM | Array<float,2>[K_N] | Per-head running max |
| FragL | Array<float,2>[K_N] | Per-head running sum |
Usage Examples
// INT8 KV decoding with 8 query heads per group on SM80
using Attention = Impl<MMA_81616, half, uint8_t, 8, 1, 64, 8, 1, 16, 128, 5>;
using Kernel = AttentionUniversal<arch::Sm80,
Mainloop<Sm80_CpAsync<5>, Attention>,
GetBlockIterFactory<half, uint8_t, 64, 128>,
DecodingCtaMap>;