Implementation:NVIDIA TransformerEngine GEMM C API
Appearance
| Field | Value |
|---|---|
| Sources | TransformerEngine |
| Domains | Deep_Learning, Optimization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Declares the C API for matrix multiplication (GEMM) operations backed by cuBLAS/cuBLASLt, supporting both standard and grouped (batched) GEMM with configurable epilogues for bias, GELU, and FP8 split accumulation.
Description
gemm.h defines the foundational compute primitive for all linear layers in TransformerEngine. It provides:
- NVTEMatmulConfig: Opaque configuration type with attribute enums for bias tensor, dbias tensor, GELU/dGELU epilogues, split accumulator, and SM count.
- NVTEGroupedMatmulConfig: Configuration for grouped GEMM with average M/N/K hints for cuBLASLt algorithm selection.
- nvte_cublas_gemm: (deprecated) Standard GEMM interface.
- nvte_cublas_gemm_v2: Successor with alpha/beta scaling and C matrix support.
- nvte_cublas_grouped_gemm: Grouped GEMM for batched operations.
- C++ RAII wrappers:
MatmulConfigWrapperandGroupedMatmulConfigWrapperfor safe resource management.
Usage
Every forward and backward pass through projection, feed-forward, and attention output layers routes through these GEMM functions, making this the most performance-critical API in the library.
Code Reference
Source Location
- Repository
NVIDIA/TransformerEngine- File
transformer_engine/common/include/transformer_engine/gemm.h- Lines
- 1--509
Signature
typedef void *NVTEMatmulConfig;
enum NVTEMatmulConfigAttribute {
kNVTEMatmulConfigBiasTensor = 0,
kNVTEMatmulConfigDBiasTensor = 1,
kNVTEMatmulConfigWithGELUEpilogue = 2,
kNVTEMatmulConfigWithDGELUEpilogue = 3,
kNVTEMatmulConfigEpilogueAuxTensor = 4,
kNVTEMatmulConfigUseSplitAccumulator = 5,
kNVTEMatmulConfigSMCount = 6,
};
NVTEMatmulConfig nvte_create_matmul_config();
void nvte_destroy_matmul_config(NVTEMatmulConfig config);
void nvte_set_matmul_config_attribute(NVTEMatmulConfig config,
NVTEMatmulConfigAttribute attr,
const void *buf, size_t size_in_bytes);
void nvte_cublas_gemm_v2(const NVTETensor A, const NVTETensor B,
NVTETensor D, NVTETensor workspace,
NVTEMatmulConfig config, cudaStream_t stream);
Import
#include "transformer_engine/gemm.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
A |
NVTETensor |
Yes | First matrix operand |
B |
NVTETensor |
Yes | Second matrix operand |
config |
NVTEMatmulConfig |
Yes | GEMM configuration (bias, epilogues, etc.) |
stream |
cudaStream_t |
Yes | CUDA stream for execution |
Outputs
| Name | Type | Description |
|---|---|---|
D |
NVTETensor |
Result matrix D = A * B (with optional epilogue) |
Usage Examples
#include "transformer_engine/gemm.h"
// Create and configure a GEMM
NVTEMatmulConfig config = nvte_create_matmul_config();
// Set bias epilogue
nvte_set_matmul_config_attribute(config, kNVTEMatmulConfigBiasTensor,
&bias_tensor, sizeof(bias_tensor));
// Execute GEMM
nvte_cublas_gemm_v2(A, B, D, workspace, config, stream);
// Cleanup
nvte_destroy_matmul_config(config);
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment