Implementation:InternLM Lmdeploy Gemm KernelImplSm90
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, GEMM |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Concrete kernel implementation for SM90 (Hopper) architectures that handles TMA descriptor creation, cluster launch configuration, and the cudaLaunchKernelEx API for launching SM90 GEMM kernels.
Description
KernelImplSm90<Gemm> extends the base Kernel class for SM90-specific GEMM kernels. Unlike KernelImpl (for SM70/80), the SM90 variant must handle:
- TMA Descriptors: Creates
CUtensorMapdescriptors for all operands (A, B, C, U, V) usingmake_2d_tma_descwith appropriate tile shapes and swizzle modes - Cluster Launch: Uses
cudaLaunchKernelExwithcudaLaunchAttributeClusterDimensionto launch with SM90 cluster support - Occupancy: Queries
cudaOccupancyMaxActiveClustersto determine optimal grid size, respecting cluster constraints - Non-portable cluster sizes: Sets
cudaFuncAttributeNonPortableClusterSizeAllowedfor large clusters (up to 16) - Dynamic tensormap: Passes a workspace buffer for runtime TMA descriptor updates needed by grouped GEMM
The constructor populates descriptors with SM90-specific settings: GMMA operation class, cluster shape, per-block quantization descriptors, and grouped GEMM striding modes.
Usage
Created by the Registry for SM90 kernel variants (v1-v5). The tuner selects among these based on measured performance.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/gemm/kernel_impl_sm90.h
Signature
template<class Gemm>
class KernelImplSm90 : public Kernel {
public:
static constexpr int TILE_M = Gemm::TILE_M;
static constexpr int TILE_N = Gemm::TILE_N;
static constexpr int TILE_K = Gemm::TILE_K;
KernelImplSm90(); // populates desc_ with SM90 specifics
int Launch(const Operation&, float alpha,
const void* A, ..., cudaStream_t stream) override;
int GetMaxSplits(...) const override; // returns 1 (no split-K)
int GetMaxSwizzle(const int4& shape) const override;
bool is_feasible(const GemmDesc& desc) const noexcept override;
};
Import
#include "src/turbomind/kernels/gemm/kernel_impl_sm90.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| Gemm (template) | type | Yes | An SM90 GemmUniversalSm90* instantiation
|
| workspace.tensormaps | void* | Yes | Buffer for runtime TMA descriptor storage |
| workspace.flags | int* | Yes | Counter for dynamic tile scheduling |
Outputs
| Name | Type | Description |
|---|---|---|
| (Launch) | int | 0 on success, launches via cudaLaunchKernelEx
|
Usage Examples
auto kernel = std::make_unique<KernelImplSm90<GemmUniversalSm90_v3<kRowMajor, 1, 2, true>>>();
kernel->Launch(op, alpha, A, Adesc, U, Udesc, B, Bdesc, V, Vdesc,
beta, C, Cdesc, D, Ddesc, swizzle, splits, workspace, stream);