Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy GemmUniversalSm90

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, GEMM
Last Updated 2026-02-07 15:00 GMT

Overview

The first SM90 (Hopper) GEMM kernel variant using TMA (Tensor Memory Accelerator) for data loading and warp-group MMA (GMMA) instructions for FP8 matrix multiplication with per-tile FP8 scaling.

Description

Distinguishing features (v1): This is the initial SM90 GEMM implementation that establishes the producer-consumer warpgroup pattern. It uses a dedicated producer warpgroup (with reduced register allocation via warpgroup_reg_dealloc<32>) to issue TMA loads, while 2 consumer warpgroups (with increased registers via warpgroup_reg_alloc<232>) perform GMMA operations.

Key characteristics:

  • MMA Atom: MMA_64x112x32_F32E4M3E4M3_SS_TN -- FP8 E4M3 inputs, FP32 accumulation
  • Tile size: 128x224x128 (CTA_M x CTA_N x CTA_K)
  • Warpgroups: 2 math (1M x 2N) + 1 producer = 3 total (CTA_SIZE = 384)
  • Cluster: 1x2 multicast (kMulticastB=2), sharing B operand across cluster
  • Pipeline: 3-stage with ClusterTransactionBarrier and ClusterBarrier
  • Scaling: Per-tile U/V scale factors loaded separately, combined as scale_U * scale_V[n] in shared memory before accumulation
  • Epilogue: TMA store via SM90_TMA_STORE_2D after converting FP32 accumulators to BF16

Usage

Instantiated for SM90 FP8 GEMM workloads. Serves as the baseline SM90 design; later variants (v2-v5) refine the scheduling, tiling, and scaling strategies.

Code Reference

Source Location

Signature

template<class Arch_>
struct GemmUniversalSm90 {
    static constexpr int CTA_M = 128, CTA_N = 224, CTA_K = 128;
    static constexpr int CTA_SIZE = 384;
    static constexpr int Stages = 3;

    using Ta = __nv_fp8_e4m3;
    using Tb = __nv_fp8_e4m3;
    using Tc = nv_bfloat16;

    __device__ void operator()(const CUtensorMap& tm_a, ...);
};

Import

#include "src/turbomind/kernels/gemm/gemm_universal_sm90.h"

I/O Contract

Inputs

Name Type Required Description
tm_a, tm_b, tm_c CUtensorMap Yes TMA descriptors for A, B, C matrices
tm_u, tm_v CUtensorMap Yes TMA descriptors for scale factors
U_, V_ const void* Yes Scale factor pointers for direct loads
sched Scheduler Yes Tile scheduler with cluster support

Outputs

Name Type Description
C matrix global memory BF16 output stored via TMA

Usage Examples

// Launched via KernelImplSm90 with cluster dimensions
gemm_kernel_sm90<Kernel><<<grid, block, smem_size, stream>>>(
    tm_a, tm_b, tm_c, tm_u, tm_v, U_ptr, ldU, V_ptr, ldV, sched);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment