Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy GemmUniversalSm90V5

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, GEMM
Last Updated 2026-02-07 15:00 GMT

Overview

SM90 GEMM v5 uses 4 math warpgroups with a narrow 96-column tile (128x96x128) and 5-stage pipeline, targeting workloads that benefit from higher SM occupancy and smaller per-CTA output tiles.

Description

Distinguishing features (v5): Doubles the number of math warpgroups to 4 while reducing the N tile dimension, increasing arithmetic intensity per CTA and enabling better occupancy on SM90 with 5 pipeline stages.

Key characteristics:

  • Tile size: 128x96x128 (narrow N for higher occupancy)
  • Warpgroups: 4 math (2M x 1N, each pair processing 64x96) + 1 producer = 5 total (CTA_SIZE = 640)
  • Math group size: 256 threads (2 warpgroups per math group)
  • GMMA config: ScaledGmmaFP8_TN<64, 96, 128, 1, 1, 1, 1>
  • Pipeline: 5-stage (deepest among all SM90 variants), maximizing latency hiding
  • Scale handling: Per-block FP8 scaling with cp.async-loaded U-scales and TMA-loaded V-scales
  • Grouped GEMM: Template parameter support like v3/v4
  • Trade-off: More threads per CTA but smaller output tile; suited for medium-N problems where 224-wide tiles would be wasteful

Usage

Selected by the tuner for FP8 GEMM problems with moderate N dimensions where 4-warpgroup occupancy outperforms wider 2-warpgroup tiles.

Code Reference

Source Location

Signature

template<Order raster_order, int multicast_a, int multicast_b, bool is_grouped_gemm_>
struct GemmUniversalSm90_v5 {
    static constexpr int TILE_M = 128, TILE_N = 96, TILE_K = 128;
    static constexpr int WARPGORUPS = 4;
    static constexpr int CTA_SIZE = 640;  // 5 warpgroups x 128
    static constexpr int Stages = 5;
    static constexpr int kMathGroupSize = 256;

    using GMMA = ScaledGmmaFP8_TN<64, 96, 128, 1, 1, 1, 1>;
};

Import

#include "src/turbomind/kernels/gemm/gemm_universal_sm90_v5.h"

I/O Contract

Inputs

Name Type Required Description
tm_a, tm_b, tm_c, tm_u, tm_v CUtensorMap Yes TMA descriptors
param_A..V, param_C MatrixParam Yes Matrix parameters
sched Scheduler Yes Tile scheduler
tensormap_buf CUtensorMap* Yes Runtime tensormap buffer

Outputs

Name Type Description
C matrix global memory BF16 output via TMA store

Usage Examples

// 4-warpgroup narrow-tile variant for moderate-N problems
GemmUniversalSm90_v5<kRowMajor, 1, 1, false> kernel;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment