Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy GemmUniversalSm90V2

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, GEMM
Last Updated 2026-02-07 15:00 GMT

Overview

SM90 GEMM v2 with a refined warpgroup scheduling model, dynamic tile scheduling, separate SmemDescIterV2 descriptor iteration, and configurable cluster multicast for A and B operands.

Description

Distinguishing features (v2): Introduces the parameterized template structure with configurable raster order, multicast dimensions, and grouped GEMM support as template parameters rather than hardcoded values.

Key characteristics:

  • Template parameters: Order raster_order, int multicast_a, int multicast_b, bool is_grouped_gemm_
  • Tile size: 128x224x128 (same base as v1)
  • Warpgroups: 2 math (1M x 2N) + 1 producer
  • SmemDescIterV2: Improved shared memory descriptor iterator with Reset method for stage management
  • Dynamic scheduler: Uses TileScheduler with next_cluster_id_ for persistent CTA scheduling
  • Scale handling: U-scales loaded via cp.async with named barriers, V-scales loaded via TMA, combined in shared memory
  • Grouped GEMM: Template flag is_grouped_gemm_ adjusts striding mode and group axis
  • Cluster shape: Configurable via multicast_a x multicast_b

Usage

Used for SM90 FP8 GEMM when flexible cluster configurations and optional grouped GEMM are needed.

Code Reference

Source Location

Signature

template<Order raster_order, int multicast_a, int multicast_b, bool is_grouped_gemm_>
struct GemmUniversalSm90_v2 {
    static constexpr int TILE_M = 128, TILE_N = 224, TILE_K = 128;
    static constexpr int CTA_SIZE = 384;  // 3 warpgroups
    static constexpr int Stages = 4;

    using Ta = __nv_fp8_e4m3;
    using Tb = __nv_fp8_e4m3;
    using Tc = nv_bfloat16;
};

Import

#include "src/turbomind/kernels/gemm/gemm_universal_sm90_v2.h"

I/O Contract

Inputs

Name Type Required Description
tm_a, tm_b, tm_c, tm_u, tm_v CUtensorMap Yes TMA descriptors
param_A, param_B, param_U, param_V, param_C MatrixParam Yes Matrix parameters with stride/offset info
sched Scheduler Yes Dynamic tile scheduler
tensormap_buf void* Yes Buffer for runtime tensormap updates (grouped GEMM)

Outputs

Name Type Description
C matrix global memory BF16 output stored via TMA

Usage Examples

// Launched via KernelImplSm90 with cudaLaunchKernelEx for cluster support
cudaLaunchKernelEx(&config, gemm_kernel_sm90<Kernel>,
    tm_a, tm_b, tm_c, tm_u, tm_v,
    param_A, param_B, param_U, param_V, param_C, sched, tensormap_buf);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment