Principle:FMInference FlexLLMGen GEMM Performance Testing

Knowledge Sources	FMInference_FlexLLMGen
Domains	Linear Algebra, Performance Tuning, GPU Computing
Last Updated	2026-02-09 12:00 GMT

Overview

Exhaustive empirical evaluation of matrix multiplication algorithm variants to identify the fastest implementation for specific matrix dimensions on a given GPU.

Description

General Matrix Multiply (GEMM) is the dominant computational primitive in transformer-based language models, accounting for the vast majority of floating-point operations during inference. GPU vendor libraries (cuBLAS, rocBLAS) provide multiple algorithm implementations for the same GEMM operation, each optimized for different matrix shapes, data types, and hardware generations. Because the relative performance of these algorithms depends on the specific matrix dimensions, the most reliable optimization strategy is empirical benchmarking: timing every available algorithm for the exact dimensions used in the target model and selecting the fastest.

The benchmarking methodology requires careful attention to measurement accuracy: warm-up iterations eliminate cold-start effects (instruction cache misses, GPU clock ramp-up), explicit device synchronization barriers ensure GPU work completion before timing, and multiple timed iterations reduce measurement noise.

Usage

Apply this principle when optimizing transformer inference for a specific model architecture and GPU. The three GEMM orientations tested (forward, backward-weights, backward-activations) correspond to the transpose configurations used in the linear layers of transformer blocks. While only the forward pass is needed for inference, all three are tested for completeness when the same infrastructure supports training.

Theoretical Basis

GEMM Algorithm Landscape

Modern cuBLAS exposes approximately 16 tensor-core GEMM algorithms (CUBLAS_GEMM_DEFAULT_TENSOR_OP through CUBLAS_GEMM_ALGO15_TENSOR_OP), each implementing a different tiling strategy, data movement pattern, and warp-level computation schedule. The performance difference between the best and worst algorithm for a given matrix size can exceed 2x.

Key factors that influence algorithm selection:

Matrix dimensions (M, N, K): Large square matrices favor different tiling than tall-skinny or short-wide matrices.
Data type: FP16 and BF16 use tensor cores with different throughput characteristics than FP32.
Memory alignment: Some algorithms require specific alignment of leading dimensions for peak performance.
GPU architecture: Optimal algorithms differ between GPU generations (Volta, Ampere, Hopper).

Warm-Up and Measurement Methodology

GPU benchmarking requires a warm-up phase because:

Clock frequency scaling: GPUs start at a base clock and ramp to boost frequency under sustained load (typically 2-5 kernel launches).
Instruction cache: The first execution of a kernel loads instructions from device memory; subsequent executions use cached instructions.
Memory subsystem: TLB and L2 cache warm-up affects initial memory access latency.

A warm-up of 5 iterations is typically sufficient to reach steady-state performance. Timing should use cudaDeviceSynchronize() barriers rather than CUDA events when comparing across algorithms, because synchronization ensures all prior GPU work has completed before stopping the timer.

Batched Strided GEMM

Multi-head attention in transformers requires computing multiple independent small GEMMs (one per attention head). Batched strided GEMM executes these in a single cuBLAS call, enabling the GPU to amortize kernel launch overhead and potentially exploit parallelism across batch elements. The optimal algorithm for batched GEMM may differ from the optimal algorithm for a single GEMM of equivalent total FLOP count, because the internal scheduling differs.

Related Pages

Implementation:FMInference_FlexLLMGen_DeepSpeed_GEMM_Test

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment