Implementation:InternLM Lmdeploy Impl M16n8

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Shared base class for m16n8-family tensor core attention implementations, providing reusable softmax, S-to-P conversion, output storage, and per-element iteration utilities.

Description

Impl_m16k8 is the CRTP-style base used by both MMA_16816 and MMA_1688 specializations. It defines the common fragment types for scores (FragS_), output (FragO), max (FragM), and sum (FragL) in the m16n8 atom layout. Key methods include: ForeachML for iterating over per-query max/L values; ForeachS for iterating over score elements with their (hi, qi, si) coordinates; Softmax implementing online safe softmax with warp-level reductions; ConvertStoP for converting float scores to half-precision probability fragments; and StoreO for writing normalized output via a user-provided callback. The kDeferReduceL flag controls whether L reduction is deferred to the Merge step.

Usage

Not used directly. Serves as a base class for Impl<MMA_16816, ...> and Impl<MMA_1688, ...>.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/impl_m16n8.h
Lines: 1-221

Signature

namespace turbomind::attention {

template<class T, int WARP_H, int WARP_Q, int WARP_S, int HeadDim>
struct Impl_m16k8 {
    static constexpr int OP_M = 16;
    static constexpr int OP_N = 8;

    static constexpr int K_M = WARP_Q / OP_M;
    static constexpr int K_N = WARP_S / OP_N;
    static constexpr int V_M = WARP_Q / OP_M;
    static constexpr int V_N = HeadDim / OP_N;

    template<class S>
    using FragS_ = Array<S, 4>[K_M][K_N];
    using FragO = Array<float, 4>[V_M][V_N];
    using FragM = Array<float, 2>[V_M];
    using FragS = FragS_<float>;
    using FragL = FragM;

    static constexpr bool kDeferReduceL = false;

    template<class Func>
    static void ForeachML(FragM& frag_M, FragL& frag_L, Func&& func);

    template<class Fragment, class Func>
    static void ForeachS(Fragment& S, Func&& func);

    template<bool is_residue>
    static void Softmax(FragS&, FragM&, FragM&, FragO&, float qk_scale);

    template<class FragP, class Storage>
    static void ConvertStoP(FragS&, FragP&, Storage&);

    template<bool is_norm, class Func, class Storage>
    static void StoreO(FragO&, FragL&, Storage&, Func&&);
};

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/impl_m16n8.h"

I/O Contract

Inputs

Name	Type	Required	Description
T	typename	Yes	Data type (half, bfloat16)
WARP_Q	int	Yes	Per-warp query tile size
WARP_S	int	Yes	Per-warp sequence tile size
HeadDim	int	Yes	Head dimension

Outputs

Name	Type	Description
FragO	Array<float,4>[V_M][V_N]	Output fragment type definition
FragM	Array<float,2>[V_M]	Max tracking fragment type
FragL	Array<float,2>[V_M]	Sum tracking fragment type

Usage Examples

// Inherited by MMA_16816 and MMA_1688 Impl specializations:
struct Impl<MMA_16816, T_, T_, ...>
    : Impl_m16k8<T_, WARP_H, WARP_Q, WARP_S, HeadDim> {
    using Base = Impl_m16k8<T_, WARP_H, WARP_Q, WARP_S, HeadDim>;
    using Base::Softmax;
    using Base::StoreO;
    // ...
};

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment