Implementation:NVIDIA TransformerEngine GEMM C API

Field	Value
Sources	TransformerEngine
Domains	Deep_Learning, Optimization
Last Updated	2026-02-07 14:00 GMT

Overview

Declares the C API for matrix multiplication (GEMM) operations backed by cuBLAS/cuBLASLt, supporting both standard and grouped (batched) GEMM with configurable epilogues for bias, GELU, and FP8 split accumulation.

Description

gemm.h defines the foundational compute primitive for all linear layers in TransformerEngine. It provides:

NVTEMatmulConfig: Opaque configuration type with attribute enums for bias tensor, dbias tensor, GELU/dGELU epilogues, split accumulator, and SM count.
NVTEGroupedMatmulConfig: Configuration for grouped GEMM with average M/N/K hints for cuBLASLt algorithm selection.
nvte_cublas_gemm: (deprecated) Standard GEMM interface.
nvte_cublas_gemm_v2: Successor with alpha/beta scaling and C matrix support.
nvte_cublas_grouped_gemm: Grouped GEMM for batched operations.
C++ RAII wrappers: MatmulConfigWrapper and GroupedMatmulConfigWrapper for safe resource management.

Usage

Every forward and backward pass through projection, feed-forward, and attention output layers routes through these GEMM functions, making this the most performance-critical API in the library.

Code Reference

Source Location

Repository: NVIDIA/TransformerEngine
File: transformer_engine/common/include/transformer_engine/gemm.h
Lines: 1--509

Signature

typedef void *NVTEMatmulConfig;

enum NVTEMatmulConfigAttribute {
  kNVTEMatmulConfigBiasTensor = 0,
  kNVTEMatmulConfigDBiasTensor = 1,
  kNVTEMatmulConfigWithGELUEpilogue = 2,
  kNVTEMatmulConfigWithDGELUEpilogue = 3,
  kNVTEMatmulConfigEpilogueAuxTensor = 4,
  kNVTEMatmulConfigUseSplitAccumulator = 5,
  kNVTEMatmulConfigSMCount = 6,
};

NVTEMatmulConfig nvte_create_matmul_config();
void nvte_destroy_matmul_config(NVTEMatmulConfig config);
void nvte_set_matmul_config_attribute(NVTEMatmulConfig config,
                                      NVTEMatmulConfigAttribute attr,
                                      const void *buf, size_t size_in_bytes);

void nvte_cublas_gemm_v2(const NVTETensor A, const NVTETensor B,
                         NVTETensor D, NVTETensor workspace,
                         NVTEMatmulConfig config, cudaStream_t stream);

Import

#include "transformer_engine/gemm.h"

I/O Contract

Inputs

Name	Type	Required	Description
`A`	`NVTETensor`	Yes	First matrix operand
`B`	`NVTETensor`	Yes	Second matrix operand
`config`	`NVTEMatmulConfig`	Yes	GEMM configuration (bias, epilogues, etc.)
`stream`	`cudaStream_t`	Yes	CUDA stream for execution

Outputs

Name	Type	Description
`D`	`NVTETensor`	Result matrix D = A * B (with optional epilogue)

Usage Examples

#include "transformer_engine/gemm.h"

// Create and configure a GEMM
NVTEMatmulConfig config = nvte_create_matmul_config();

// Set bias epilogue
nvte_set_matmul_config_attribute(config, kNVTEMatmulConfigBiasTensor,
                                 &bias_tensor, sizeof(bias_tensor));

// Execute GEMM
nvte_cublas_gemm_v2(A, B, D, workspace, config, stream);

// Cleanup
nvte_destroy_matmul_config(config);

Related Pages

Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment