Implementation:InternLM Lmdeploy Core Mma
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Tensor_Core |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Inline PTX wrappers for NVIDIA Tensor Core matrix-multiply-accumulate (MMA) instructions across SM70, SM75, and SM80 architectures.
Description
This header provides device-inlined wrappers for the mma.sync.aligned PTX instructions used to drive Tensor Cores. It supports multiple MMA shapes and data types: mma_m8n8k4_row_col and mma_m8n8k4_row_row for SM70 (Volta) with f16 inputs and f32 accumulators; mma_m16n8k8_row_col for SM75 (Turing) with f16/bf16 inputs and f32 or f16/bf16 accumulators; and mma_m16n8k16_row_col for SM80 (Ampere) with f16/bf16 inputs. Each function maps directly to a single PTX instruction via inline assembly, with the accumulator passed by reference for fused multiply-add semantics. Fallback decomposition is provided for m16n8k16 on pre-SM80 architectures by issuing two m16n8k8 operations.
Usage
Use these wrappers in custom GEMM or attention kernels to issue Tensor Core MMA instructions from CUDA C++ code, selecting the appropriate shape and precision for the target GPU architecture.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/core/mma.h
Signature
// SM70 (Volta)
void mma_m8n8k4_row_col(Array<float, 8>& d, const Array<half, 4>& a,
const Array<half, 4>& b, Array<float, 8>& c);
// SM75 (Turing)
void mma_m16n8k8_row_col(Array<float, 4>& d, const Array<half, 4>& a,
const Array<half, 2>& b, Array<float, 4>& c);
// SM80 (Ampere) - half
void mma_m16n8k16_row_col(Array<float, 4>& d, const Array<half, 8>& a,
const Array<half, 4>& b, Array<float, 4>& c);
// SM80 (Ampere) - bfloat16
void mma_m16n8k16_row_col(Array<float, 4>& d, const Array<nv_bfloat16, 8>& a,
const Array<nv_bfloat16, 4>& b, Array<float, 4>& c);
Import
#include "src/turbomind/kernels/core/mma.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| a | Array<half/bf16, K> | Yes | Matrix A fragment (row-major) |
| b | Array<half/bf16, K> | Yes | Matrix B fragment (col-major or row-major) |
| c | Array<float/half/bf16, M> | Yes | Accumulator input (C in D = A*B + C) |
Outputs
| Name | Type | Description |
|---|---|---|
| d | Array<float/half/bf16, M>& | MMA result (D = A*B + C) |
Usage Examples
using namespace turbomind;
// SM80 m16n8k16 MMA with half precision
Array<half, 8> frag_a;
Array<half, 4> frag_b;
Array<float, 4> accum;
clear(accum);
mma_m16n8k16_row_col(accum, frag_a, frag_b, accum);