Implementation:Deepspeedai DeepSpeed Evoformer MMA Pipelined
| Knowledge Sources | |
|---|---|
| Domains | Attention, CUTLASS_Kernels, DeepSpeed4Science |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A double-buffered threadblock-scoped GEMM implementation for overlapping data transfers with computation using two stages of shared memory.
Description
CustomMmaPipelined implements a classic double-buffered matrix multiplication kernel with two shared memory stages. While one stage is being computed on by tensor cores, the next stage is being loaded from global memory, effectively hiding memory latency. This implementation works across all GPU architectures (Volta, Turing, Ampere) and serves as a fallback when multistage pipelining cannot be used due to shared memory constraints or when targeting pre-Ampere GPUs. The template supports optional data type transformations (TransformA/TransformB) applied during the global-to-shared memory transfer, enabling mixed-precision computations with on-the-fly conversions.
Usage
This MMA operator is used in Evoformer attention kernels for Volta/Turing GPUs or when shared memory pressure prevents using the multistage variant, providing reliable performance with a simple two-stage software pipeline.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/deepspeed4science/evoformer_attn/gemm/custom_mma_pipelined.h
Signature
template <
typename Shape_, // Gemm problem shape
typename IteratorA_, // Global memory iterator for A
typename SmemIteratorA_, // Shared memory iterator for A
typename IteratorB_, // Global memory iterator for B
typename SmemIteratorB_, // Shared memory iterator for B
typename ElementC_, // Accumulator data type
typename LayoutC_, // Accumulator layout
typename Policy_, // MMA policy
typename TransformA_ = NumericArrayConverter<
typename SmemIteratorA_::Element,
typename IteratorA_::Element,
IteratorA_::Fragment::kElements>,
typename TransformB_ = NumericArrayConverter<
typename SmemIteratorB_::Element,
typename IteratorB_::Element,
IteratorB_::Fragment::kElements>,
typename Enable = bool>
class CustomMmaPipelined : public CustomMmaBase<Shape_, Policy_, 2>;
Import
#include "csrc/deepspeed4science/evoformer_attn/gemm/custom_mma_pipelined.h"
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| Inputs | ||
| iterator_A | IteratorA | Iterator over global memory tiles of operand A |
| iterator_B | IteratorB | Iterator over global memory tiles of operand B |
| gemm_k_iterations | int | Number of iterations along K dimension |
| Outputs | ||
| accum | FragmentC | Accumulator fragment containing C = A × B results |
| Configuration | ||
| kStages | int | Always 2 for double-buffered pipeline |
| TransformA/B | Functor | Optional data type conversion during load |
Usage Examples
// Configure double-buffered MMA for Volta
using MmaPipelined = cutlass::gemm::threadblock::CustomMmaPipelined<
cutlass::gemm::GemmShape<64, 64, 32>, // Threadblock shape
IteratorA, // A operand iterator
SmemIteratorA, // A shared memory writer
IteratorB, // B operand iterator
SmemIteratorB, // B shared memory writer
float, // Accumulator type
cutlass::layout::RowMajor, // Accumulator layout
MmaPolicy // Warp MMA policy
>;
// Execute GEMM with double-buffering
MmaPipelined mma(shared_storage_A, shared_storage_B, thread_idx, warp_idx, lane_idx);
mma(gemm_k_iterations, accum, iterator_A, iterator_B);