Implementation:InternLM Lmdeploy Impl 1688

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Attention implementation using m16n8k8 tensor core MMA instructions (Turing SM75), providing QK and PV compute stages with 2-stage shared memory pipelining.

Description

This is the MMA_1688 specialization of Impl, targeting SM75 (Turing) GPUs. It defines fragment types sized for the m16n8k8 MMA atom: FragQ uses Array<T,4> tiles and FragK uses Array<T,2> tiles. K tiles are loaded in (s32,d8) chunks and V tiles in (d32,s8) chunks via LDSM instructions. The shared memory layout uses union storage where Q and K/V share the same memory region (2-stage only). This implementation inherits softmax, output storage, and iteration helpers from Impl_m16k8.

Usage

Selected by AttentionConfig for SM75 (Turing) prefill attention. Paired with the Sm70 mainloop (register-based pipelining without cp.async).

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/impl_1688.h
Lines: 1-231

Signature

namespace turbomind::attention {

template<class T_, int CTA_H_, int CTA_Q_, int CTA_S_,
         int WARP_H, int WARP_Q, int WARP_S, int HeadDim>
struct Impl<MMA_1688, T_, T_, CTA_H_, CTA_Q_, CTA_S_,
            WARP_H, WARP_Q, WARP_S, HeadDim, 2>
    : Impl_m16k8<T_, WARP_H, WARP_Q, WARP_S, HeadDim> {

    using T = T_;
    using Tkv = T_;
    static constexpr int OP_K = 8;

    using FragQ = Array<T, 4>[K_K][K_M];
    using FragK = Array<T, 2>[K_K][K_N];
    using FragP = Array<T, 4>[V_M][V_K];
    using FragV = Array<T, 2>[V_K][V_N];

    union SharedStorage {
        T Q[SmemLayoutQ::kSize];
        struct { Tkv K[...]; Tkv V[...]; };
    };

    struct StateQK { ... };
    struct StatePV { ... };

    static void TransformQ(T* smem_Q, FragQ& frag_Q);
    static void ComputeQK(StateQK&, FragS&, int, auto&&, auto&&);
    static void ComputePV(StatePV&, FragO&, int, auto&&, auto&&);
};

} // namespace turbomind::attention

Import

#include "src/turbomind/kernels/attention/impl_1688.h"

I/O Contract

Inputs

Name	Type	Required	Description
T_	typename	Yes	Data type (half or bfloat16)
CTA_Q_	int	Yes	CTA query tile size
CTA_S_	int	Yes	CTA sequence tile size
HeadDim	int	Yes	Head dimension

Outputs

Name	Type	Description
FragO	Array<float,4>[V_M][V_N]	Accumulated output fragment
FragM	Array<float,2>[V_M]	Row-wise maximum for softmax
FragL	Array<float,2>[V_M]	Row-wise sum for softmax

Usage Examples

using Attention = Impl<MMA_1688, half, half, 1, 64, 64, 1, 16, 64, 128, 2>;
// Used with Mainloop<arch::Sm70, Attention> for Turing prefill

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment