Implementation:InternLM Lmdeploy Core Mma

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Tensor_Core
Last Updated	2026-02-07 15:00 GMT

Overview

Inline PTX wrappers for NVIDIA Tensor Core matrix-multiply-accumulate (MMA) instructions across SM70, SM75, and SM80 architectures.

Description

This header provides device-inlined wrappers for the mma.sync.aligned PTX instructions used to drive Tensor Cores. It supports multiple MMA shapes and data types: mma_m8n8k4_row_col and mma_m8n8k4_row_row for SM70 (Volta) with f16 inputs and f32 accumulators; mma_m16n8k8_row_col for SM75 (Turing) with f16/bf16 inputs and f32 or f16/bf16 accumulators; and mma_m16n8k16_row_col for SM80 (Ampere) with f16/bf16 inputs. Each function maps directly to a single PTX instruction via inline assembly, with the accumulator passed by reference for fused multiply-add semantics. Fallback decomposition is provided for m16n8k16 on pre-SM80 architectures by issuing two m16n8k8 operations.

Usage

Use these wrappers in custom GEMM or attention kernels to issue Tensor Core MMA instructions from CUDA C++ code, selecting the appropriate shape and precision for the target GPU architecture.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/core/mma.h

Signature

// SM70 (Volta)
void mma_m8n8k4_row_col(Array<float, 8>& d, const Array<half, 4>& a,
                         const Array<half, 4>& b, Array<float, 8>& c);

// SM75 (Turing)
void mma_m16n8k8_row_col(Array<float, 4>& d, const Array<half, 4>& a,
                          const Array<half, 2>& b, Array<float, 4>& c);

// SM80 (Ampere) - half
void mma_m16n8k16_row_col(Array<float, 4>& d, const Array<half, 8>& a,
                           const Array<half, 4>& b, Array<float, 4>& c);

// SM80 (Ampere) - bfloat16
void mma_m16n8k16_row_col(Array<float, 4>& d, const Array<nv_bfloat16, 8>& a,
                           const Array<nv_bfloat16, 4>& b, Array<float, 4>& c);

Import

#include "src/turbomind/kernels/core/mma.h"

I/O Contract

Inputs

Name	Type	Required	Description
a	Array<half/bf16, K>	Yes	Matrix A fragment (row-major)
b	Array<half/bf16, K>	Yes	Matrix B fragment (col-major or row-major)
c	Array<float/half/bf16, M>	Yes	Accumulator input (C in D = A*B + C)

Outputs

Name	Type	Description
d	Array<float/half/bf16, M>&	MMA result (D = A*B + C)

Usage Examples

using namespace turbomind;

// SM80 m16n8k16 MMA with half precision
Array<half, 8> frag_a;
Array<half, 4> frag_b;
Array<float, 4> accum;
clear(accum);

mma_m16n8k16_row_col(accum, frag_a, frag_b, accum);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment