Implementation:InternLM Lmdeploy GemmUniversalSm90V5

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, GEMM
Last Updated	2026-02-07 15:00 GMT

Overview

SM90 GEMM v5 uses 4 math warpgroups with a narrow 96-column tile (128x96x128) and 5-stage pipeline, targeting workloads that benefit from higher SM occupancy and smaller per-CTA output tiles.

Description

Distinguishing features (v5): Doubles the number of math warpgroups to 4 while reducing the N tile dimension, increasing arithmetic intensity per CTA and enabling better occupancy on SM90 with 5 pipeline stages.

Key characteristics:

Tile size: 128x96x128 (narrow N for higher occupancy)
Warpgroups: 4 math (2M x 1N, each pair processing 64x96) + 1 producer = 5 total (CTA_SIZE = 640)
Math group size: 256 threads (2 warpgroups per math group)
GMMA config: ScaledGmmaFP8_TN<64, 96, 128, 1, 1, 1, 1>
Pipeline: 5-stage (deepest among all SM90 variants), maximizing latency hiding
Scale handling: Per-block FP8 scaling with cp.async-loaded U-scales and TMA-loaded V-scales
Grouped GEMM: Template parameter support like v3/v4
Trade-off: More threads per CTA but smaller output tile; suited for medium-N problems where 224-wide tiles would be wasteful

Usage

Selected by the tuner for FP8 GEMM problems with moderate N dimensions where 4-warpgroup occupancy outperforms wider 2-warpgroup tiles.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/gemm/gemm_universal_sm90_v5.h

Signature

template<Order raster_order, int multicast_a, int multicast_b, bool is_grouped_gemm_>
struct GemmUniversalSm90_v5 {
    static constexpr int TILE_M = 128, TILE_N = 96, TILE_K = 128;
    static constexpr int WARPGORUPS = 4;
    static constexpr int CTA_SIZE = 640;  // 5 warpgroups x 128
    static constexpr int Stages = 5;
    static constexpr int kMathGroupSize = 256;

    using GMMA = ScaledGmmaFP8_TN<64, 96, 128, 1, 1, 1, 1>;
};

Import

#include "src/turbomind/kernels/gemm/gemm_universal_sm90_v5.h"

I/O Contract

Inputs

Name	Type	Required	Description
tm_a, tm_b, tm_c, tm_u, tm_v	CUtensorMap	Yes	TMA descriptors
param_A..V, param_C	MatrixParam	Yes	Matrix parameters
sched	Scheduler	Yes	Tile scheduler
tensormap_buf	CUtensorMap*	Yes	Runtime tensormap buffer

Outputs

Name	Type	Description
C matrix	global memory	BF16 output via TMA store

Usage Examples

// 4-warpgroup narrow-tile variant for moderate-N problems
GemmUniversalSm90_v5<kRowMajor, 1, 1, false> kernel;

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment