Implementation:Deepspeedai DeepSpeed Evoformer MMA Accum Lambda

Knowledge Sources	DeepSpeed Evoformer Attention
Domains	Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated	2026-02-09 00:00 GMT

Overview

An iterator utility that maps tensor core accumulator register layout to logical matrix row/column coordinates, enabling efficient per-element and per-row operations on accumulator fragments.

Description

AccumLambdaIteratorSm80 and AccumLambdaIteratorSm70 provide architecture-specific abstractions for navigating the complex accumulator register layouts of tensor cores. Tensor cores distribute accumulator elements across threads in a warp using architecture-dependent patterns that are not sequential. These iterators expose methods like iterateRows that apply lambda functions to each accumulator element along with its logical (row, column) coordinate, and reduceSameRow that performs warp-level reductions across threads holding the same row. The SM80 variant handles Ampere's 16x8x8 tensor core instruction layout, while SM70 handles Volta's 8x8x4 layout. This abstraction is critical for implementing operations like softmax, row-wise max/sum, and layer normalization.

Usage

These iterators are used throughout the Evoformer attention kernels when computing attention statistics (max, sum), applying softmax normalization, or performing gradient computations that require per-row operations on accumulator fragments.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/deepspeed4science/evoformer_attn/gemm/mma_accum_lambda_iterator.h

Signature

// Ampere (SM80+) variant
template <typename T, typename accum_t, int kWarpSize>
struct AccumLambdaIteratorSm80 {
    using Policy = typename T::Policy;
    using InstructionShape = typename T::InstructionShape;
    using Shape = typename T::Shape;

    static cutlass::MatrixCoord CUTLASS_DEVICE
    get_lane_offset(int8_t lane_id, int8_t warp_id,
                    typename T::TensorCoord const& tile_offset);

    template <typename FA, typename FB, typename FC>
    CUTLASS_DEVICE static void iterateRows(
        cutlass::MatrixCoord& lane_offset,
        FA beginRow, FB op, FC endRow);

    template <typename DT, typename F>
    CUTLASS_DEVICE static bool reduceSameRow(int lane_id, DT& myValue, F fn);
};

// Volta (SM70) variant
template <typename T, typename accum_t, int kWarpSize>
struct AccumLambdaIteratorSm70 { /* similar interface */ };

Import

#include "csrc/deepspeed4science/evoformer_attn/gemm/mma_accum_lambda_iterator.h"

I/O Contract

Method	Input	Output	Description
get_lane_offset	lane_id, warp_id, tile_offset	MatrixCoord	Computes logical (row, col) for thread's first accumulator element
iterateRows	lane_offset, beginRow(), op(), endRow()	void	Applies functors to each element: beginRow(row), op(row, col, idx), endRow(row)
reduceSameRow	lane_id, value, reduction_fn	bool (true if owns result)	Reduces values across threads in same row using warp shuffles

Usage Examples

// Compute row-wise maximum for softmax
using AccumIterator = AccumLambdaIteratorSm80<MmaTensorOp, float, 32>;

auto lane_offset = AccumIterator::get_lane_offset(lane_id, warp_id, tile_offset);

// Find max value in each row
float row_max = -INFINITY;
AccumIterator::iterateRows(
    lane_offset,
    [&](int row) { row_max = -INFINITY; },               // beginRow
    [&](int row, int col, int idx) {                     // op
        row_max = fmaxf(row_max, accum_fragment[idx]);
    },
    [&](int row) {}                                      // endRow
);

// Reduce across threads holding the same row
bool owns_result = AccumIterator::reduceSameRow(
    lane_id, row_max, [](float a, float b) { return fmaxf(a, b); }
);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment