Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA TransformerEngine GEMM C API

From Leeroopedia


Field Value
Sources TransformerEngine
Domains Deep_Learning, Optimization
Last Updated 2026-02-07 14:00 GMT

Overview

Declares the C API for matrix multiplication (GEMM) operations backed by cuBLAS/cuBLASLt, supporting both standard and grouped (batched) GEMM with configurable epilogues for bias, GELU, and FP8 split accumulation.

Description

gemm.h defines the foundational compute primitive for all linear layers in TransformerEngine. It provides:

  • NVTEMatmulConfig: Opaque configuration type with attribute enums for bias tensor, dbias tensor, GELU/dGELU epilogues, split accumulator, and SM count.
  • NVTEGroupedMatmulConfig: Configuration for grouped GEMM with average M/N/K hints for cuBLASLt algorithm selection.
  • nvte_cublas_gemm: (deprecated) Standard GEMM interface.
  • nvte_cublas_gemm_v2: Successor with alpha/beta scaling and C matrix support.
  • nvte_cublas_grouped_gemm: Grouped GEMM for batched operations.
  • C++ RAII wrappers: MatmulConfigWrapper and GroupedMatmulConfigWrapper for safe resource management.

Usage

Every forward and backward pass through projection, feed-forward, and attention output layers routes through these GEMM functions, making this the most performance-critical API in the library.

Code Reference

Source Location

Repository
NVIDIA/TransformerEngine
File
transformer_engine/common/include/transformer_engine/gemm.h
Lines
1--509

Signature

typedef void *NVTEMatmulConfig;

enum NVTEMatmulConfigAttribute {
  kNVTEMatmulConfigBiasTensor = 0,
  kNVTEMatmulConfigDBiasTensor = 1,
  kNVTEMatmulConfigWithGELUEpilogue = 2,
  kNVTEMatmulConfigWithDGELUEpilogue = 3,
  kNVTEMatmulConfigEpilogueAuxTensor = 4,
  kNVTEMatmulConfigUseSplitAccumulator = 5,
  kNVTEMatmulConfigSMCount = 6,
};

NVTEMatmulConfig nvte_create_matmul_config();
void nvte_destroy_matmul_config(NVTEMatmulConfig config);
void nvte_set_matmul_config_attribute(NVTEMatmulConfig config,
                                      NVTEMatmulConfigAttribute attr,
                                      const void *buf, size_t size_in_bytes);

void nvte_cublas_gemm_v2(const NVTETensor A, const NVTETensor B,
                         NVTETensor D, NVTETensor workspace,
                         NVTEMatmulConfig config, cudaStream_t stream);

Import

#include "transformer_engine/gemm.h"

I/O Contract

Inputs

Name Type Required Description
A NVTETensor Yes First matrix operand
B NVTETensor Yes Second matrix operand
config NVTEMatmulConfig Yes GEMM configuration (bias, epilogues, etc.)
stream cudaStream_t Yes CUDA stream for execution

Outputs

Name Type Description
D NVTETensor Result matrix D = A * B (with optional epilogue)

Usage Examples

#include "transformer_engine/gemm.h"

// Create and configure a GEMM
NVTEMatmulConfig config = nvte_create_matmul_config();

// Set bias epilogue
nvte_set_matmul_config_attribute(config, kNVTEMatmulConfigBiasTensor,
                                 &bias_tensor, sizeof(bias_tensor));

// Execute GEMM
nvte_cublas_gemm_v2(A, B, D, workspace, config, stream);

// Cleanup
nvte_destroy_matmul_config(config);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment