Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy Gemm KernelImplSm90

From Leeroopedia
Revision as of 15:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/InternLM_Lmdeploy_Gemm_KernelImplSm90.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains GPU_Kernels, GEMM
Last Updated 2026-02-07 15:00 GMT

Overview

Concrete kernel implementation for SM90 (Hopper) architectures that handles TMA descriptor creation, cluster launch configuration, and the cudaLaunchKernelEx API for launching SM90 GEMM kernels.

Description

KernelImplSm90<Gemm> extends the base Kernel class for SM90-specific GEMM kernels. Unlike KernelImpl (for SM70/80), the SM90 variant must handle:

  • TMA Descriptors: Creates CUtensorMap descriptors for all operands (A, B, C, U, V) using make_2d_tma_desc with appropriate tile shapes and swizzle modes
  • Cluster Launch: Uses cudaLaunchKernelEx with cudaLaunchAttributeClusterDimension to launch with SM90 cluster support
  • Occupancy: Queries cudaOccupancyMaxActiveClusters to determine optimal grid size, respecting cluster constraints
  • Non-portable cluster sizes: Sets cudaFuncAttributeNonPortableClusterSizeAllowed for large clusters (up to 16)
  • Dynamic tensormap: Passes a workspace buffer for runtime TMA descriptor updates needed by grouped GEMM

The constructor populates descriptors with SM90-specific settings: GMMA operation class, cluster shape, per-block quantization descriptors, and grouped GEMM striding modes.

Usage

Created by the Registry for SM90 kernel variants (v1-v5). The tuner selects among these based on measured performance.

Code Reference

Source Location

Signature

template<class Gemm>
class KernelImplSm90 : public Kernel {
public:
    static constexpr int TILE_M = Gemm::TILE_M;
    static constexpr int TILE_N = Gemm::TILE_N;
    static constexpr int TILE_K = Gemm::TILE_K;

    KernelImplSm90();  // populates desc_ with SM90 specifics

    int Launch(const Operation&, float alpha,
               const void* A, ..., cudaStream_t stream) override;
    int GetMaxSplits(...) const override;  // returns 1 (no split-K)
    int GetMaxSwizzle(const int4& shape) const override;
    bool is_feasible(const GemmDesc& desc) const noexcept override;
};

Import

#include "src/turbomind/kernels/gemm/kernel_impl_sm90.h"

I/O Contract

Inputs

Name Type Required Description
Gemm (template) type Yes An SM90 GemmUniversalSm90* instantiation
workspace.tensormaps void* Yes Buffer for runtime TMA descriptor storage
workspace.flags int* Yes Counter for dynamic tile scheduling

Outputs

Name Type Description
(Launch) int 0 on success, launches via cudaLaunchKernelEx

Usage Examples

auto kernel = std::make_unique<KernelImplSm90<GemmUniversalSm90_v3<kRowMajor, 1, 2, true>>>();
kernel->Launch(op, alpha, A, Adesc, U, Udesc, B, Bdesc, V, Vdesc,
               beta, C, Cdesc, D, Ddesc, swizzle, splits, workspace, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment