Implementation:Deepspeedai DeepSpeed Evoformer MMA Pipelined

Knowledge Sources	DeepSpeed Evoformer Attention
Domains	Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated	2026-02-09 00:00 GMT

Overview

A double-buffered threadblock-scoped GEMM implementation for overlapping data transfers with computation using two stages of shared memory.

Description

CustomMmaPipelined implements a classic double-buffered matrix multiplication kernel with two shared memory stages. While one stage is being computed on by tensor cores, the next stage is being loaded from global memory, effectively hiding memory latency. This implementation works across all GPU architectures (Volta, Turing, Ampere) and serves as a fallback when multistage pipelining cannot be used due to shared memory constraints or when targeting pre-Ampere GPUs. The template supports optional data type transformations (TransformA/TransformB) applied during the global-to-shared memory transfer, enabling mixed-precision computations with on-the-fly conversions.

Usage

This MMA operator is used in Evoformer attention kernels for Volta/Turing GPUs or when shared memory pressure prevents using the multistage variant, providing reliable performance with a simple two-stage software pipeline.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/deepspeed4science/evoformer_attn/gemm/custom_mma_pipelined.h

Signature

template <
    typename Shape_,                    // Gemm problem shape
    typename IteratorA_,                // Global memory iterator for A
    typename SmemIteratorA_,            // Shared memory iterator for A
    typename IteratorB_,                // Global memory iterator for B
    typename SmemIteratorB_,            // Shared memory iterator for B
    typename ElementC_,                 // Accumulator data type
    typename LayoutC_,                  // Accumulator layout
    typename Policy_,                   // MMA policy
    typename TransformA_ = NumericArrayConverter<
        typename SmemIteratorA_::Element,
        typename IteratorA_::Element,
        IteratorA_::Fragment::kElements>,
    typename TransformB_ = NumericArrayConverter<
        typename SmemIteratorB_::Element,
        typename IteratorB_::Element,
        IteratorB_::Fragment::kElements>,
    typename Enable = bool>
class CustomMmaPipelined : public CustomMmaBase<Shape_, Policy_, 2>;

Import

#include "csrc/deepspeed4science/evoformer_attn/gemm/custom_mma_pipelined.h"

I/O Contract

Parameter	Type	Description
Inputs
iterator_A	IteratorA	Iterator over global memory tiles of operand A
iterator_B	IteratorB	Iterator over global memory tiles of operand B
gemm_k_iterations	int	Number of iterations along K dimension
Outputs
accum	FragmentC	Accumulator fragment containing C = A × B results
Configuration
kStages	int	Always 2 for double-buffered pipeline
TransformA/B	Functor	Optional data type conversion during load

Usage Examples

// Configure double-buffered MMA for Volta
using MmaPipelined = cutlass::gemm::threadblock::CustomMmaPipelined<
    cutlass::gemm::GemmShape<64, 64, 32>,   // Threadblock shape
    IteratorA,                               // A operand iterator
    SmemIteratorA,                           // A shared memory writer
    IteratorB,                               // B operand iterator
    SmemIteratorB,                           // B shared memory writer
    float,                                   // Accumulator type
    cutlass::layout::RowMajor,               // Accumulator layout
    MmaPolicy                                // Warp MMA policy
>;

// Execute GEMM with double-buffering
MmaPipelined mma(shared_storage_A, shared_storage_B, thread_idx, warp_idx, lane_idx);
mma(gemm_k_iterations, accum, iterator_A, iterator_B);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment