Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy Core Mma

From Leeroopedia
Revision as of 15:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/InternLM_Lmdeploy_Core_Mma.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains GPU_Kernels, Tensor_Core
Last Updated 2026-02-07 15:00 GMT

Overview

Inline PTX wrappers for NVIDIA Tensor Core matrix-multiply-accumulate (MMA) instructions across SM70, SM75, and SM80 architectures.

Description

This header provides device-inlined wrappers for the mma.sync.aligned PTX instructions used to drive Tensor Cores. It supports multiple MMA shapes and data types: mma_m8n8k4_row_col and mma_m8n8k4_row_row for SM70 (Volta) with f16 inputs and f32 accumulators; mma_m16n8k8_row_col for SM75 (Turing) with f16/bf16 inputs and f32 or f16/bf16 accumulators; and mma_m16n8k16_row_col for SM80 (Ampere) with f16/bf16 inputs. Each function maps directly to a single PTX instruction via inline assembly, with the accumulator passed by reference for fused multiply-add semantics. Fallback decomposition is provided for m16n8k16 on pre-SM80 architectures by issuing two m16n8k8 operations.

Usage

Use these wrappers in custom GEMM or attention kernels to issue Tensor Core MMA instructions from CUDA C++ code, selecting the appropriate shape and precision for the target GPU architecture.

Code Reference

Source Location

Signature

// SM70 (Volta)
void mma_m8n8k4_row_col(Array<float, 8>& d, const Array<half, 4>& a,
                         const Array<half, 4>& b, Array<float, 8>& c);

// SM75 (Turing)
void mma_m16n8k8_row_col(Array<float, 4>& d, const Array<half, 4>& a,
                          const Array<half, 2>& b, Array<float, 4>& c);

// SM80 (Ampere) - half
void mma_m16n8k16_row_col(Array<float, 4>& d, const Array<half, 8>& a,
                           const Array<half, 4>& b, Array<float, 4>& c);

// SM80 (Ampere) - bfloat16
void mma_m16n8k16_row_col(Array<float, 4>& d, const Array<nv_bfloat16, 8>& a,
                           const Array<nv_bfloat16, 4>& b, Array<float, 4>& c);

Import

#include "src/turbomind/kernels/core/mma.h"

I/O Contract

Inputs

Name Type Required Description
a Array<half/bf16, K> Yes Matrix A fragment (row-major)
b Array<half/bf16, K> Yes Matrix B fragment (col-major or row-major)
c Array<float/half/bf16, M> Yes Accumulator input (C in D = A*B + C)

Outputs

Name Type Description
d Array<float/half/bf16, M>& MMA result (D = A*B + C)

Usage Examples

using namespace turbomind;

// SM80 m16n8k16 MMA with half precision
Array<half, 8> frag_a;
Array<half, 4> frag_b;
Array<float, 4> accum;
clear(accum);

mma_m16n8k16_row_col(accum, frag_a, frag_b, accum);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment