Implementation:InternLM Lmdeploy Gemm KernelImplSm90

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, GEMM
Last Updated	2026-02-07 15:00 GMT

Overview

Concrete kernel implementation for SM90 (Hopper) architectures that handles TMA descriptor creation, cluster launch configuration, and the cudaLaunchKernelEx API for launching SM90 GEMM kernels.

Description

KernelImplSm90<Gemm> extends the base Kernel class for SM90-specific GEMM kernels. Unlike KernelImpl (for SM70/80), the SM90 variant must handle:

TMA Descriptors: Creates CUtensorMap descriptors for all operands (A, B, C, U, V) using make_2d_tma_desc with appropriate tile shapes and swizzle modes
Cluster Launch: Uses cudaLaunchKernelEx with cudaLaunchAttributeClusterDimension to launch with SM90 cluster support
Occupancy: Queries cudaOccupancyMaxActiveClusters to determine optimal grid size, respecting cluster constraints
Non-portable cluster sizes: Sets cudaFuncAttributeNonPortableClusterSizeAllowed for large clusters (up to 16)
Dynamic tensormap: Passes a workspace buffer for runtime TMA descriptor updates needed by grouped GEMM

The constructor populates descriptors with SM90-specific settings: GMMA operation class, cluster shape, per-block quantization descriptors, and grouped GEMM striding modes.

Usage

Created by the Registry for SM90 kernel variants (v1-v5). The tuner selects among these based on measured performance.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/gemm/kernel_impl_sm90.h

Signature

template<class Gemm>
class KernelImplSm90 : public Kernel {
public:
    static constexpr int TILE_M = Gemm::TILE_M;
    static constexpr int TILE_N = Gemm::TILE_N;
    static constexpr int TILE_K = Gemm::TILE_K;

    KernelImplSm90();  // populates desc_ with SM90 specifics

    int Launch(const Operation&, float alpha,
               const void* A, ..., cudaStream_t stream) override;
    int GetMaxSplits(...) const override;  // returns 1 (no split-K)
    int GetMaxSwizzle(const int4& shape) const override;
    bool is_feasible(const GemmDesc& desc) const noexcept override;
};

Import

#include "src/turbomind/kernels/gemm/kernel_impl_sm90.h"

I/O Contract

Inputs

Name	Type	Required	Description
Gemm (template)	type	Yes	An SM90 `GemmUniversalSm90*` instantiation
workspace.tensormaps	void*	Yes	Buffer for runtime TMA descriptor storage
workspace.flags	int*	Yes	Counter for dynamic tile scheduling

Outputs

Name	Type	Description
(Launch)	int	0 on success, launches via `cudaLaunchKernelEx`

Usage Examples

auto kernel = std::make_unique<KernelImplSm90<GemmUniversalSm90_v3<kRowMajor, 1, 2, true>>>();
kernel->Launch(op, alpha, A, Adesc, U, Udesc, B, Bdesc, V, Vdesc,
               beta, C, Cdesc, D, Ddesc, swizzle, splits, workspace, stream);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment