Implementation:InternLM Lmdeploy Impl 1688
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Attention |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Attention implementation using m16n8k8 tensor core MMA instructions (Turing SM75), providing QK and PV compute stages with 2-stage shared memory pipelining.
Description
This is the MMA_1688 specialization of Impl, targeting SM75 (Turing) GPUs. It defines fragment types sized for the m16n8k8 MMA atom: FragQ uses Array<T,4> tiles and FragK uses Array<T,2> tiles. K tiles are loaded in (s32,d8) chunks and V tiles in (d32,s8) chunks via LDSM instructions. The shared memory layout uses union storage where Q and K/V share the same memory region (2-stage only). This implementation inherits softmax, output storage, and iteration helpers from Impl_m16k8.
Usage
Selected by AttentionConfig for SM75 (Turing) prefill attention. Paired with the Sm70 mainloop (register-based pipelining without cp.async).
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/attention/impl_1688.h
- Lines: 1-231
Signature
namespace turbomind::attention {
template<class T_, int CTA_H_, int CTA_Q_, int CTA_S_,
int WARP_H, int WARP_Q, int WARP_S, int HeadDim>
struct Impl<MMA_1688, T_, T_, CTA_H_, CTA_Q_, CTA_S_,
WARP_H, WARP_Q, WARP_S, HeadDim, 2>
: Impl_m16k8<T_, WARP_H, WARP_Q, WARP_S, HeadDim> {
using T = T_;
using Tkv = T_;
static constexpr int OP_K = 8;
using FragQ = Array<T, 4>[K_K][K_M];
using FragK = Array<T, 2>[K_K][K_N];
using FragP = Array<T, 4>[V_M][V_K];
using FragV = Array<T, 2>[V_K][V_N];
union SharedStorage {
T Q[SmemLayoutQ::kSize];
struct { Tkv K[...]; Tkv V[...]; };
};
struct StateQK { ... };
struct StatePV { ... };
static void TransformQ(T* smem_Q, FragQ& frag_Q);
static void ComputeQK(StateQK&, FragS&, int, auto&&, auto&&);
static void ComputePV(StatePV&, FragO&, int, auto&&, auto&&);
};
} // namespace turbomind::attention
Import
#include "src/turbomind/kernels/attention/impl_1688.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| T_ | typename | Yes | Data type (half or bfloat16) |
| CTA_Q_ | int | Yes | CTA query tile size |
| CTA_S_ | int | Yes | CTA sequence tile size |
| HeadDim | int | Yes | Head dimension |
Outputs
| Name | Type | Description |
|---|---|---|
| FragO | Array<float,4>[V_M][V_N] | Accumulated output fragment |
| FragM | Array<float,2>[V_M] | Row-wise maximum for softmax |
| FragL | Array<float,2>[V_M] | Row-wise sum for softmax |
Usage Examples
using Attention = Impl<MMA_1688, half, half, 1, 64, 64, 1, 16, 64, 128, 2>;
// Used with Mainloop<arch::Sm70, Attention> for Turing prefill