Implementation:InternLM Lmdeploy Impl M16n8
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Shared base class for m16n8-family tensor core attention implementations, providing reusable softmax, S-to-P conversion, output storage, and per-element iteration utilities.
Description
Impl_m16k8 is the CRTP-style base used by both MMA_16816 and MMA_1688 specializations. It defines the common fragment types for scores (FragS_), output (FragO), max (FragM), and sum (FragL) in the m16n8 atom layout. Key methods include: ForeachML for iterating over per-query max/L values; ForeachS for iterating over score elements with their (hi, qi, si) coordinates; Softmax implementing online safe softmax with warp-level reductions; ConvertStoP for converting float scores to half-precision probability fragments; and StoreO for writing normalized output via a user-provided callback. The kDeferReduceL flag controls whether L reduction is deferred to the Merge step.
Usage
Not used directly. Serves as a base class for Impl<MMA_16816, ...> and Impl<MMA_1688, ...>.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/impl_m16n8.h
- Lines: 1-221
Signature
namespace turbomind::attention {
template<class T, int WARP_H, int WARP_Q, int WARP_S, int HeadDim>
struct Impl_m16k8 {
static constexpr int OP_M = 16;
static constexpr int OP_N = 8;
static constexpr int K_M = WARP_Q / OP_M;
static constexpr int K_N = WARP_S / OP_N;
static constexpr int V_M = WARP_Q / OP_M;
static constexpr int V_N = HeadDim / OP_N;
template<class S>
using FragS_ = Array<S, 4>[K_M][K_N];
using FragO = Array<float, 4>[V_M][V_N];
using FragM = Array<float, 2>[V_M];
using FragS = FragS_<float>;
using FragL = FragM;
static constexpr bool kDeferReduceL = false;
template<class Func>
static void ForeachML(FragM& frag_M, FragL& frag_L, Func&& func);
template<class Fragment, class Func>
static void ForeachS(Fragment& S, Func&& func);
template<bool is_residue>
static void Softmax(FragS&, FragM&, FragM&, FragO&, float qk_scale);
template<class FragP, class Storage>
static void ConvertStoP(FragS&, FragP&, Storage&);
template<bool is_norm, class Func, class Storage>
static void StoreO(FragO&, FragL&, Storage&, Func&&);
};
} // namespace turbomind::attention
Import
#include "src/turbomind/kernels/attention/impl_m16n8.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| T | typename | Yes | Data type (half, bfloat16) |
| WARP_Q | int | Yes | Per-warp query tile size |
| WARP_S | int | Yes | Per-warp sequence tile size |
| HeadDim | int | Yes | Head dimension |
Outputs
| Name | Type | Description |
|---|---|---|
| FragO | Array<float,4>[V_M][V_N] | Output fragment type definition |
| FragM | Array<float,2>[V_M] | Max tracking fragment type |
| FragL | Array<float,2>[V_M] | Sum tracking fragment type |
Usage Examples
// Inherited by MMA_16816 and MMA_1688 Impl specializations:
struct Impl<MMA_16816, T_, T_, ...>
: Impl_m16k8<T_, WARP_H, WARP_Q, WARP_S, HeadDim> {
using Base = Impl_m16k8<T_, WARP_H, WARP_Q, WARP_S, HeadDim>;
using Base::Softmax;
using Base::StoreO;
// ...
};