Implementation:InternLM Lmdeploy GemmUniversalSm90V5
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, GEMM |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
SM90 GEMM v5 uses 4 math warpgroups with a narrow 96-column tile (128x96x128) and 5-stage pipeline, targeting workloads that benefit from higher SM occupancy and smaller per-CTA output tiles.
Description
Distinguishing features (v5): Doubles the number of math warpgroups to 4 while reducing the N tile dimension, increasing arithmetic intensity per CTA and enabling better occupancy on SM90 with 5 pipeline stages.
Key characteristics:
- Tile size: 128x96x128 (narrow N for higher occupancy)
- Warpgroups: 4 math (2M x 1N, each pair processing 64x96) + 1 producer = 5 total (CTA_SIZE = 640)
- Math group size: 256 threads (2 warpgroups per math group)
- GMMA config:
ScaledGmmaFP8_TN<64, 96, 128, 1, 1, 1, 1> - Pipeline: 5-stage (deepest among all SM90 variants), maximizing latency hiding
- Scale handling: Per-block FP8 scaling with cp.async-loaded U-scales and TMA-loaded V-scales
- Grouped GEMM: Template parameter support like v3/v4
- Trade-off: More threads per CTA but smaller output tile; suited for medium-N problems where 224-wide tiles would be wasteful
Usage
Selected by the tuner for FP8 GEMM problems with moderate N dimensions where 4-warpgroup occupancy outperforms wider 2-warpgroup tiles.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/gemm/gemm_universal_sm90_v5.h
Signature
template<Order raster_order, int multicast_a, int multicast_b, bool is_grouped_gemm_>
struct GemmUniversalSm90_v5 {
static constexpr int TILE_M = 128, TILE_N = 96, TILE_K = 128;
static constexpr int WARPGORUPS = 4;
static constexpr int CTA_SIZE = 640; // 5 warpgroups x 128
static constexpr int Stages = 5;
static constexpr int kMathGroupSize = 256;
using GMMA = ScaledGmmaFP8_TN<64, 96, 128, 1, 1, 1, 1>;
};
Import
#include "src/turbomind/kernels/gemm/gemm_universal_sm90_v5.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tm_a, tm_b, tm_c, tm_u, tm_v | CUtensorMap | Yes | TMA descriptors |
| param_A..V, param_C | MatrixParam | Yes | Matrix parameters |
| sched | Scheduler | Yes | Tile scheduler |
| tensormap_buf | CUtensorMap* | Yes | Runtime tensormap buffer |
Outputs
| Name | Type | Description |
|---|---|---|
| C matrix | global memory | BF16 output via TMA store |
Usage Examples
// 4-warpgroup narrow-tile variant for moderate-N problems
GemmUniversalSm90_v5<kRowMajor, 1, 1, false> kernel;