Implementation:InternLM Lmdeploy GemmUniversalSm90

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, GEMM
Last Updated	2026-02-07 15:00 GMT

Overview

The first SM90 (Hopper) GEMM kernel variant using TMA (Tensor Memory Accelerator) for data loading and warp-group MMA (GMMA) instructions for FP8 matrix multiplication with per-tile FP8 scaling.

Description

Distinguishing features (v1): This is the initial SM90 GEMM implementation that establishes the producer-consumer warpgroup pattern. It uses a dedicated producer warpgroup (with reduced register allocation via warpgroup_reg_dealloc<32>) to issue TMA loads, while 2 consumer warpgroups (with increased registers via warpgroup_reg_alloc<232>) perform GMMA operations.

Key characteristics:

MMA Atom: MMA_64x112x32_F32E4M3E4M3_SS_TN -- FP8 E4M3 inputs, FP32 accumulation
Tile size: 128x224x128 (CTA_M x CTA_N x CTA_K)
Warpgroups: 2 math (1M x 2N) + 1 producer = 3 total (CTA_SIZE = 384)
Cluster: 1x2 multicast (kMulticastB=2), sharing B operand across cluster
Pipeline: 3-stage with ClusterTransactionBarrier and ClusterBarrier
Scaling: Per-tile U/V scale factors loaded separately, combined as scale_U * scale_V[n] in shared memory before accumulation
Epilogue: TMA store via SM90_TMA_STORE_2D after converting FP32 accumulators to BF16

Usage

Instantiated for SM90 FP8 GEMM workloads. Serves as the baseline SM90 design; later variants (v2-v5) refine the scheduling, tiling, and scaling strategies.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/gemm/gemm_universal_sm90.h

Signature

template<class Arch_>
struct GemmUniversalSm90 {
    static constexpr int CTA_M = 128, CTA_N = 224, CTA_K = 128;
    static constexpr int CTA_SIZE = 384;
    static constexpr int Stages = 3;

    using Ta = __nv_fp8_e4m3;
    using Tb = __nv_fp8_e4m3;
    using Tc = nv_bfloat16;

    __device__ void operator()(const CUtensorMap& tm_a, ...);
};

Import

#include "src/turbomind/kernels/gemm/gemm_universal_sm90.h"

I/O Contract

Inputs

Name	Type	Required	Description
tm_a, tm_b, tm_c	CUtensorMap	Yes	TMA descriptors for A, B, C matrices
tm_u, tm_v	CUtensorMap	Yes	TMA descriptors for scale factors
U_, V_	const void*	Yes	Scale factor pointers for direct loads
sched	Scheduler	Yes	Tile scheduler with cluster support

Outputs

Name	Type	Description
C matrix	global memory	BF16 output stored via TMA

Usage Examples

// Launched via KernelImplSm90 with cluster dimensions
gemm_kernel_sm90<Kernel><<<grid, block, smem_size, stream>>>(
    tm_a, tm_b, tm_c, tm_u, tm_v, U_ptr, ldU, V_ptr, ldV, sched);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment