Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed GEMM Test

From Leeroopedia


Knowledge Sources
Domains Performance_Testing, Linear_Algebra, CUDA_Kernels, Optimization
Last Updated 2026-02-09 00:00 GMT

Overview

Performance testing utilities for CUDA GEMM operations that benchmark different cuBLAS algorithms to find optimal settings for transformer workloads.

Description

The GEMM Test framework provides templated classes (GemmTest and StridedGemmTest) for systematically evaluating different cuBLAS GEMM algorithms to identify the fastest configuration for specific matrix dimensions. It performs warm-up iterations followed by timed runs to measure average latency for each algorithm variant. The implementation supports both standard GEMM operations and strided batched GEMM used extensively in transformer attention mechanisms. For each operation type (forward, backward weight gradient, backward activation gradient), the framework iterates through available cuBLAS algorithms (CUBLAS_GEMM_DEFAULT_TENSOR_OP through CUBLAS_GEMM_ALGO15_TENSOR_OP on NVIDIA, or rocBLAS equivalents on AMD) and reports the fastest algorithm ID along with its execution time.

Usage

Use these test utilities during model initialization or as part of an auto-tuning phase to determine optimal GEMM algorithms for your specific hardware and model dimensions. The discovered algorithm IDs can then be passed to transformer layers for production inference or training to ensure peak performance.

Code Reference

Source Location

Signature

template <typename T>
class GemmTest {
public:
    GemmTest(int m, int n, int k,
            cublasOperation_t ta, cublasOperation_t tb,
            cublasHandle_t h);
    ~GemmTest();

    // Returns [algo_forward, algo_backward1, algo_backward2]
    std::array<int, 3> TestAlgo(int loops);

    template <typename Func>
    int Run(int loops, Func f);
};

template <typename T>
class StridedGemmTest {
public:
    StridedGemmTest(int b, int m, int n, int k,
                   cublasOperation_t ta, cublasOperation_t tb,
                   cublasHandle_t h);
    ~StridedGemmTest();

    std::array<int, 3> TestAlgo(int loops);
};

Import

#include "csrc/includes/gemm_test.h"

I/O Contract

Input Type Description
m, n, k int Matrix dimensions (m×k) × (k×n) = (m×n)
b int Batch size (for strided variant)
ta, tb cublasOperation_t Transpose flags for matrices A and B
handle cublasHandle_t cuBLAS handle for operations
loops int Number of timing iterations per algorithm
Output Type Description
algorithm_ids std::array<int,3> Optimal algorithm IDs [forward, bwd_wgrad, bwd_dgrad]

Usage Examples

Standard GEMM Testing:

cublasHandle_t handle;
cublasCreate(&handle);

// Test attention QK^T multiplication: (batch*heads*seq×head_dim) × (head_dim×seq)
int M = 12288;  // batch * heads * seq_length
int N = 1024;   // sequence_length
int K = 64;     // head_dimension

GemmTest<__half> test(M, N, K,
                     CUBLAS_OP_T, CUBLAS_OP_N,
                     handle);

auto algos = test.TestAlgo(100);  // 100 timing loops
printf("Best algorithms: FW=%d, BWD1=%d, BWD2=%d\n",
       algos[0], algos[1], algos[2]);

Strided Batched GEMM for Attention:

// Test attention score computation across all heads
int batch = 32;
int heads = 16;
int seq_length = 512;
int head_dim = 64;

int bsz = batch * heads;     // 512 batches
int M = seq_length;          // 512
int N = seq_length;          // 512
int K = head_dim;            // 64

StridedGemmTest<__half> attn_test(bsz, M, N, K,
                                  CUBLAS_OP_T, CUBLAS_OP_N,
                                  handle);

auto attn_algos = attn_test.TestAlgo(50);
// Use attn_algos for attention kernels

Auto-Tuning at Model Init:

std::vector<std::array<int, 3>> tune_gemm_algorithms(
    int hidden_size, int intermediate_size,
    int batch_size, int seq_length, int num_heads)
{
    cublasHandle_t handle;
    cublasCreate(&handle);

    std::vector<std::array<int, 3>> algos;

    // Test QKV projection
    GemmTest<__half> qkv_test(batch_size * seq_length,
                              3 * hidden_size, hidden_size,
                              CUBLAS_OP_N, CUBLAS_OP_N, handle);
    algos.push_back(qkv_test.TestAlgo(20));

    // Test feed-forward layers
    GemmTest<__half> ff1_test(batch_size * seq_length,
                             intermediate_size, hidden_size,
                             CUBLAS_OP_N, CUBLAS_OP_N, handle);
    algos.push_back(ff1_test.TestAlgo(20));

    // Test attention scores
    int head_dim = hidden_size / num_heads;
    StridedGemmTest<__half> attn_test(batch_size * num_heads,
                                      seq_length, seq_length, head_dim,
                                      CUBLAS_OP_T, CUBLAS_OP_N, handle);
    algos.push_back(attn_test.TestAlgo(20));

    cublasDestroy(handle);
    return algos;
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment