Implementation:InternLM Lmdeploy GemmUniversalSm90
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, GEMM |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
The first SM90 (Hopper) GEMM kernel variant using TMA (Tensor Memory Accelerator) for data loading and warp-group MMA (GMMA) instructions for FP8 matrix multiplication with per-tile FP8 scaling.
Description
Distinguishing features (v1): This is the initial SM90 GEMM implementation that establishes the producer-consumer warpgroup pattern. It uses a dedicated producer warpgroup (with reduced register allocation via warpgroup_reg_dealloc<32>) to issue TMA loads, while 2 consumer warpgroups (with increased registers via warpgroup_reg_alloc<232>) perform GMMA operations.
Key characteristics:
- MMA Atom:
MMA_64x112x32_F32E4M3E4M3_SS_TN-- FP8 E4M3 inputs, FP32 accumulation - Tile size: 128x224x128 (CTA_M x CTA_N x CTA_K)
- Warpgroups: 2 math (1M x 2N) + 1 producer = 3 total (CTA_SIZE = 384)
- Cluster: 1x2 multicast (kMulticastB=2), sharing B operand across cluster
- Pipeline: 3-stage with
ClusterTransactionBarrierandClusterBarrier - Scaling: Per-tile U/V scale factors loaded separately, combined as
scale_U * scale_V[n]in shared memory before accumulation - Epilogue: TMA store via
SM90_TMA_STORE_2Dafter converting FP32 accumulators to BF16
Usage
Instantiated for SM90 FP8 GEMM workloads. Serves as the baseline SM90 design; later variants (v2-v5) refine the scheduling, tiling, and scaling strategies.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/gemm/gemm_universal_sm90.h
Signature
template<class Arch_>
struct GemmUniversalSm90 {
static constexpr int CTA_M = 128, CTA_N = 224, CTA_K = 128;
static constexpr int CTA_SIZE = 384;
static constexpr int Stages = 3;
using Ta = __nv_fp8_e4m3;
using Tb = __nv_fp8_e4m3;
using Tc = nv_bfloat16;
__device__ void operator()(const CUtensorMap& tm_a, ...);
};
Import
#include "src/turbomind/kernels/gemm/gemm_universal_sm90.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tm_a, tm_b, tm_c | CUtensorMap | Yes | TMA descriptors for A, B, C matrices |
| tm_u, tm_v | CUtensorMap | Yes | TMA descriptors for scale factors |
| U_, V_ | const void* | Yes | Scale factor pointers for direct loads |
| sched | Scheduler | Yes | Tile scheduler with cluster support |
Outputs
| Name | Type | Description |
|---|---|---|
| C matrix | global memory | BF16 output stored via TMA |
Usage Examples
// Launched via KernelImplSm90 with cluster dimensions
gemm_kernel_sm90<Kernel><<<grid, block, smem_size, stream>>>(
tm_a, tm_b, tm_c, tm_u, tm_v, U_ptr, ldU, V_ptr, ldV, sched);