Implementation:InternLM Lmdeploy GemmUniversalSm90V2
Appearance
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, GEMM |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
SM90 GEMM v2 with a refined warpgroup scheduling model, dynamic tile scheduling, separate SmemDescIterV2 descriptor iteration, and configurable cluster multicast for A and B operands.
Description
Distinguishing features (v2): Introduces the parameterized template structure with configurable raster order, multicast dimensions, and grouped GEMM support as template parameters rather than hardcoded values.
Key characteristics:
- Template parameters:
Order raster_order, int multicast_a, int multicast_b, bool is_grouped_gemm_ - Tile size: 128x224x128 (same base as v1)
- Warpgroups: 2 math (1M x 2N) + 1 producer
- SmemDescIterV2: Improved shared memory descriptor iterator with
Resetmethod for stage management - Dynamic scheduler: Uses
TileSchedulerwithnext_cluster_id_for persistent CTA scheduling - Scale handling: U-scales loaded via
cp.asyncwith named barriers, V-scales loaded via TMA, combined in shared memory - Grouped GEMM: Template flag
is_grouped_gemm_adjusts striding mode and group axis - Cluster shape: Configurable via
multicast_axmulticast_b
Usage
Used for SM90 FP8 GEMM when flexible cluster configurations and optional grouped GEMM are needed.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/gemm/gemm_universal_sm90_v2.h
Signature
template<Order raster_order, int multicast_a, int multicast_b, bool is_grouped_gemm_>
struct GemmUniversalSm90_v2 {
static constexpr int TILE_M = 128, TILE_N = 224, TILE_K = 128;
static constexpr int CTA_SIZE = 384; // 3 warpgroups
static constexpr int Stages = 4;
using Ta = __nv_fp8_e4m3;
using Tb = __nv_fp8_e4m3;
using Tc = nv_bfloat16;
};
Import
#include "src/turbomind/kernels/gemm/gemm_universal_sm90_v2.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tm_a, tm_b, tm_c, tm_u, tm_v | CUtensorMap | Yes | TMA descriptors |
| param_A, param_B, param_U, param_V, param_C | MatrixParam | Yes | Matrix parameters with stride/offset info |
| sched | Scheduler | Yes | Dynamic tile scheduler |
| tensormap_buf | void* | Yes | Buffer for runtime tensormap updates (grouped GEMM) |
Outputs
| Name | Type | Description |
|---|---|---|
| C matrix | global memory | BF16 output stored via TMA |
Usage Examples
// Launched via KernelImplSm90 with cudaLaunchKernelEx for cluster support
cudaLaunchKernelEx(&config, gemm_kernel_sm90<Kernel>,
tm_a, tm_b, tm_c, tm_u, tm_v,
param_A, param_B, param_U, param_V, param_C, sched, tensormap_buf);
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment