Implementation:Deepspeedai DeepSpeed GEMM Test

Knowledge Sources	DeepSpeed
Domains	Performance_Testing, Linear_Algebra, CUDA_Kernels, Optimization
Last Updated	2026-02-09 00:00 GMT

Overview

Performance testing utilities for CUDA GEMM operations that benchmark different cuBLAS algorithms to find optimal settings for transformer workloads.

Description

The GEMM Test framework provides templated classes (GemmTest and StridedGemmTest) for systematically evaluating different cuBLAS GEMM algorithms to identify the fastest configuration for specific matrix dimensions. It performs warm-up iterations followed by timed runs to measure average latency for each algorithm variant. The implementation supports both standard GEMM operations and strided batched GEMM used extensively in transformer attention mechanisms. For each operation type (forward, backward weight gradient, backward activation gradient), the framework iterates through available cuBLAS algorithms (CUBLAS_GEMM_DEFAULT_TENSOR_OP through CUBLAS_GEMM_ALGO15_TENSOR_OP on NVIDIA, or rocBLAS equivalents on AMD) and reports the fastest algorithm ID along with its execution time.

Usage

Use these test utilities during model initialization or as part of an auto-tuning phase to determine optimal GEMM algorithms for your specific hardware and model dimensions. The discovered algorithm IDs can then be passed to transformer layers for production inference or training to ensure peak performance.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/includes/gemm_test.h

Signature

template <typename T>
class GemmTest {
public:
    GemmTest(int m, int n, int k,
            cublasOperation_t ta, cublasOperation_t tb,
            cublasHandle_t h);
    ~GemmTest();

    // Returns [algo_forward, algo_backward1, algo_backward2]
    std::array<int, 3> TestAlgo(int loops);

    template <typename Func>
    int Run(int loops, Func f);
};

template <typename T>
class StridedGemmTest {
public:
    StridedGemmTest(int b, int m, int n, int k,
                   cublasOperation_t ta, cublasOperation_t tb,
                   cublasHandle_t h);
    ~StridedGemmTest();

    std::array<int, 3> TestAlgo(int loops);
};

Import

#include "csrc/includes/gemm_test.h"

I/O Contract

Input	Type	Description
m, n, k	int	Matrix dimensions (m×k) × (k×n) = (m×n)
b	int	Batch size (for strided variant)
ta, tb	cublasOperation_t	Transpose flags for matrices A and B
handle	cublasHandle_t	cuBLAS handle for operations
loops	int	Number of timing iterations per algorithm

Output	Type	Description
algorithm_ids	std::array<int,3>	Optimal algorithm IDs [forward, bwd_wgrad, bwd_dgrad]

Usage Examples

Standard GEMM Testing:

cublasHandle_t handle;
cublasCreate(&handle);

// Test attention QK^T multiplication: (batch*heads*seq×head_dim) × (head_dim×seq)
int M = 12288;  // batch * heads * seq_length
int N = 1024;   // sequence_length
int K = 64;     // head_dimension

GemmTest<__half> test(M, N, K,
                     CUBLAS_OP_T, CUBLAS_OP_N,
                     handle);

auto algos = test.TestAlgo(100);  // 100 timing loops
printf("Best algorithms: FW=%d, BWD1=%d, BWD2=%d\n",
       algos[0], algos[1], algos[2]);

Strided Batched GEMM for Attention:

// Test attention score computation across all heads
int batch = 32;
int heads = 16;
int seq_length = 512;
int head_dim = 64;

int bsz = batch * heads;     // 512 batches
int M = seq_length;          // 512
int N = seq_length;          // 512
int K = head_dim;            // 64

StridedGemmTest<__half> attn_test(bsz, M, N, K,
                                  CUBLAS_OP_T, CUBLAS_OP_N,
                                  handle);

auto attn_algos = attn_test.TestAlgo(50);
// Use attn_algos for attention kernels

Auto-Tuning at Model Init:

std::vector<std::array<int, 3>> tune_gemm_algorithms(
    int hidden_size, int intermediate_size,
    int batch_size, int seq_length, int num_heads)
{
    cublasHandle_t handle;
    cublasCreate(&handle);

    std::vector<std::array<int, 3>> algos;

    // Test QKV projection
    GemmTest<__half> qkv_test(batch_size * seq_length,
                              3 * hidden_size, hidden_size,
                              CUBLAS_OP_N, CUBLAS_OP_N, handle);
    algos.push_back(qkv_test.TestAlgo(20));

    // Test feed-forward layers
    GemmTest<__half> ff1_test(batch_size * seq_length,
                             intermediate_size, hidden_size,
                             CUBLAS_OP_N, CUBLAS_OP_N, handle);
    algos.push_back(ff1_test.TestAlgo(20));

    // Test attention scores
    int head_dim = hidden_size / num_heads;
    StridedGemmTest<__half> attn_test(batch_size * num_heads,
                                      seq_length, seq_length, head_dim,
                                      CUBLAS_OP_T, CUBLAS_OP_N, handle);
    algos.push_back(attn_test.TestAlgo(20));

    cublasDestroy(handle);
    return algos;
}

Related Pages

Inference cuBLAS Wrappers - Uses discovered algorithm IDs
Transformer CUDA - Applies optimal algorithms in transformer layers
Inference Context - Stores algorithm configurations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment