Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed Evoformer MMA Accum Lambda

From Leeroopedia


Knowledge Sources
Domains Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated 2026-02-09 00:00 GMT

Overview

An iterator utility that maps tensor core accumulator register layout to logical matrix row/column coordinates, enabling efficient per-element and per-row operations on accumulator fragments.

Description

AccumLambdaIteratorSm80 and AccumLambdaIteratorSm70 provide architecture-specific abstractions for navigating the complex accumulator register layouts of tensor cores. Tensor cores distribute accumulator elements across threads in a warp using architecture-dependent patterns that are not sequential. These iterators expose methods like iterateRows that apply lambda functions to each accumulator element along with its logical (row, column) coordinate, and reduceSameRow that performs warp-level reductions across threads holding the same row. The SM80 variant handles Ampere's 16x8x8 tensor core instruction layout, while SM70 handles Volta's 8x8x4 layout. This abstraction is critical for implementing operations like softmax, row-wise max/sum, and layer normalization.

Usage

These iterators are used throughout the Evoformer attention kernels when computing attention statistics (max, sum), applying softmax normalization, or performing gradient computations that require per-row operations on accumulator fragments.

Code Reference

Source Location

Signature

// Ampere (SM80+) variant
template <typename T, typename accum_t, int kWarpSize>
struct AccumLambdaIteratorSm80 {
    using Policy = typename T::Policy;
    using InstructionShape = typename T::InstructionShape;
    using Shape = typename T::Shape;

    static cutlass::MatrixCoord CUTLASS_DEVICE
    get_lane_offset(int8_t lane_id, int8_t warp_id,
                    typename T::TensorCoord const& tile_offset);

    template <typename FA, typename FB, typename FC>
    CUTLASS_DEVICE static void iterateRows(
        cutlass::MatrixCoord& lane_offset,
        FA beginRow, FB op, FC endRow);

    template <typename DT, typename F>
    CUTLASS_DEVICE static bool reduceSameRow(int lane_id, DT& myValue, F fn);
};

// Volta (SM70) variant
template <typename T, typename accum_t, int kWarpSize>
struct AccumLambdaIteratorSm70 { /* similar interface */ };

Import

#include "csrc/deepspeed4science/evoformer_attn/gemm/mma_accum_lambda_iterator.h"

I/O Contract

Method Input Output Description
get_lane_offset lane_id, warp_id, tile_offset MatrixCoord Computes logical (row, col) for thread's first accumulator element
iterateRows lane_offset, beginRow(), op(), endRow() void Applies functors to each element: beginRow(row), op(row, col, idx), endRow(row)
reduceSameRow lane_id, value, reduction_fn bool (true if owns result) Reduces values across threads in same row using warp shuffles

Usage Examples

// Compute row-wise maximum for softmax
using AccumIterator = AccumLambdaIteratorSm80<MmaTensorOp, float, 32>;

auto lane_offset = AccumIterator::get_lane_offset(lane_id, warp_id, tile_offset);

// Find max value in each row
float row_max = -INFINITY;
AccumIterator::iterateRows(
    lane_offset,
    [&](int row) { row_max = -INFINITY; },               // beginRow
    [&](int row, int col, int idx) {                     // op
        row_max = fmaxf(row_max, accum_fragment[idx]);
    },
    [&](int row) {}                                      // endRow
);

// Reduce across threads holding the same row
bool owns_result = AccumIterator::reduceSameRow(
    lane_id, row_max, [](float a, float b) { return fmaxf(a, b); }
);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment