Implementation:InternLM Lmdeploy GemmUniversalSm90V2

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, GEMM
Last Updated	2026-02-07 15:00 GMT

Overview

SM90 GEMM v2 with a refined warpgroup scheduling model, dynamic tile scheduling, separate SmemDescIterV2 descriptor iteration, and configurable cluster multicast for A and B operands.

Description

Distinguishing features (v2): Introduces the parameterized template structure with configurable raster order, multicast dimensions, and grouped GEMM support as template parameters rather than hardcoded values.

Key characteristics:

Template parameters: Order raster_order, int multicast_a, int multicast_b, bool is_grouped_gemm_
Tile size: 128x224x128 (same base as v1)
Warpgroups: 2 math (1M x 2N) + 1 producer
SmemDescIterV2: Improved shared memory descriptor iterator with Reset method for stage management
Dynamic scheduler: Uses TileScheduler with next_cluster_id_ for persistent CTA scheduling
Scale handling: U-scales loaded via cp.async with named barriers, V-scales loaded via TMA, combined in shared memory
Grouped GEMM: Template flag is_grouped_gemm_ adjusts striding mode and group axis
Cluster shape: Configurable via multicast_a x multicast_b

Usage

Used for SM90 FP8 GEMM when flexible cluster configurations and optional grouped GEMM are needed.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/gemm/gemm_universal_sm90_v2.h

Signature

template<Order raster_order, int multicast_a, int multicast_b, bool is_grouped_gemm_>
struct GemmUniversalSm90_v2 {
    static constexpr int TILE_M = 128, TILE_N = 224, TILE_K = 128;
    static constexpr int CTA_SIZE = 384;  // 3 warpgroups
    static constexpr int Stages = 4;

    using Ta = __nv_fp8_e4m3;
    using Tb = __nv_fp8_e4m3;
    using Tc = nv_bfloat16;
};

Import

#include "src/turbomind/kernels/gemm/gemm_universal_sm90_v2.h"

I/O Contract

Inputs

Name	Type	Required	Description
tm_a, tm_b, tm_c, tm_u, tm_v	CUtensorMap	Yes	TMA descriptors
param_A, param_B, param_U, param_V, param_C	MatrixParam	Yes	Matrix parameters with stride/offset info
sched	Scheduler	Yes	Dynamic tile scheduler
tensormap_buf	void*	Yes	Buffer for runtime tensormap updates (grouped GEMM)

Outputs

Name	Type	Description
C matrix	global memory	BF16 output stored via TMA

Usage Examples

// Launched via KernelImplSm90 with cudaLaunchKernelEx for cluster support
cudaLaunchKernelEx(&config, gemm_kernel_sm90<Kernel>,
    tm_a, tm_b, tm_c, tm_u, tm_v,
    param_A, param_B, param_U, param_V, param_C, sched, tensormap_buf);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment