Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed Evoformer MMA Pipelined

From Leeroopedia


Knowledge Sources
Domains Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated 2026-02-09 00:00 GMT

Overview

A double-buffered threadblock-scoped GEMM implementation for overlapping data transfers with computation using two stages of shared memory.

Description

CustomMmaPipelined implements a classic double-buffered matrix multiplication kernel with two shared memory stages. While one stage is being computed on by tensor cores, the next stage is being loaded from global memory, effectively hiding memory latency. This implementation works across all GPU architectures (Volta, Turing, Ampere) and serves as a fallback when multistage pipelining cannot be used due to shared memory constraints or when targeting pre-Ampere GPUs. The template supports optional data type transformations (TransformA/TransformB) applied during the global-to-shared memory transfer, enabling mixed-precision computations with on-the-fly conversions.

Usage

This MMA operator is used in Evoformer attention kernels for Volta/Turing GPUs or when shared memory pressure prevents using the multistage variant, providing reliable performance with a simple two-stage software pipeline.

Code Reference

Source Location

Signature

template <
    typename Shape_,                    // Gemm problem shape
    typename IteratorA_,                // Global memory iterator for A
    typename SmemIteratorA_,            // Shared memory iterator for A
    typename IteratorB_,                // Global memory iterator for B
    typename SmemIteratorB_,            // Shared memory iterator for B
    typename ElementC_,                 // Accumulator data type
    typename LayoutC_,                  // Accumulator layout
    typename Policy_,                   // MMA policy
    typename TransformA_ = NumericArrayConverter<
        typename SmemIteratorA_::Element,
        typename IteratorA_::Element,
        IteratorA_::Fragment::kElements>,
    typename TransformB_ = NumericArrayConverter<
        typename SmemIteratorB_::Element,
        typename IteratorB_::Element,
        IteratorB_::Fragment::kElements>,
    typename Enable = bool>
class CustomMmaPipelined : public CustomMmaBase<Shape_, Policy_, 2>;

Import

#include "csrc/deepspeed4science/evoformer_attn/gemm/custom_mma_pipelined.h"

I/O Contract

Parameter Type Description
Inputs
iterator_A IteratorA Iterator over global memory tiles of operand A
iterator_B IteratorB Iterator over global memory tiles of operand B
gemm_k_iterations int Number of iterations along K dimension
Outputs
accum FragmentC Accumulator fragment containing C = A × B results
Configuration
kStages int Always 2 for double-buffered pipeline
TransformA/B Functor Optional data type conversion during load

Usage Examples

// Configure double-buffered MMA for Volta
using MmaPipelined = cutlass::gemm::threadblock::CustomMmaPipelined<
    cutlass::gemm::GemmShape<64, 64, 32>,   // Threadblock shape
    IteratorA,                               // A operand iterator
    SmemIteratorA,                           // A shared memory writer
    IteratorB,                               // B operand iterator
    SmemIteratorB,                           // B shared memory writer
    float,                                   // Accumulator type
    cutlass::layout::RowMajor,               // Accumulator layout
    MmaPolicy                                // Warp MMA policy
>;

// Execute GEMM with double-buffering
MmaPipelined mma(shared_storage_A, shared_storage_B, thread_idx, warp_idx, lane_idx);
mma(gemm_k_iterations, accum, iterator_A, iterator_B);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment