Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Deepspeedai DeepSpeed Evoformer Attention Kernels

From Leeroopedia


Knowledge Sources
Domains Scientific_Computing, Attention_Mechanisms, CUDA_Kernels
Last Updated 2026-02-09 00:00 GMT

Overview

Memory-efficient CUTLASS-based CUDA kernels implementing the attention mechanism for Evoformer architectures used in protein structure prediction models.

Description

Evoformer Attention Kernels implement a highly optimized, memory-efficient attention mechanism tailored for the Evoformer block architecture used in AlphaFold2 and related protein structure prediction models. Unlike standard Transformer attention that operates on 2D sequences, Evoformer attention operates on pair representations with additional bias terms that encode evolutionary and structural relationships.

Built on NVIDIA's CUTLASS template library, these kernels provide:

  • Custom threadblock-scoped GEMM operations: Specialized matrix-multiply-accumulate (MMA) implementations that fuse the attention score computation (Q*K^T) and value projection (Attn*V) with softmax normalization in a single kernel launch, avoiding materialization of the full attention matrix
  • Pipelined epilogue stages: Multi-stage epilogue pipelines that overlap memory writes with computation, supporting rescaling (for softmax normalization), gradient bias accumulation, and output formatting
  • Specialized tile iterators: Custom iterators for accessing pair representation tensors with residual tile handling (for dimensions that are not multiples of the tile size), atomic gradient accumulation for backward pass parallelism, and shared-memory-backed warp-level iteration
  • Bias broadcasting: Efficient broadcasting of pair bias tensors across the batch and head dimensions of the attention computation
  • Log-sum-exp tracking: Numerically stable log-sum-exp computation for the softmax denominator, enabling memory-efficient backward pass without storing the full attention matrix
  • Forward and backward kernels: Complete differentiable implementation with optimized forward and backward passes

Usage

These kernels are invoked through DeepSpeed's Evoformer attention operator, typically used when training or running inference on AlphaFold2-style models. The operator is JIT-compiled via the EvoformerAttnBuilder op builder and requires NVIDIA GPU with compute capability >= 7.0 (Volta or newer) and CUTLASS headers.

Theoretical Basis

Evoformer attention extends standard multi-head attention with pair bias terms. For query Q, key K, value V, and pair bias B, the attention output is:

Attention(Q, K, V, B) = softmax(Q * K^T / sqrt(d_k) + B) * V

Memory-efficient formulation: Standard attention materializes the full N x N attention matrix, requiring O(N^2) memory. The memory-efficient approach computes attention in tiles, maintaining only a running log-sum-exp for softmax normalization:

  1. Divide Q into blocks of size B_q and K, V into blocks of size B_kv
  2. For each Q block, iterate over K/V blocks, accumulating weighted values and updating the log-sum-exp
  3. Rescale the accumulated output using the final log-sum-exp

This reduces memory from O(N^2) to O(N * B_q), enabling longer sequences.

CUTLASS tiling hierarchy:

  • Threadblock level: Each threadblock computes a tile of the output (e.g., 64x64)
  • Warp level: Within a threadblock, warps execute MMA instructions on sub-tiles (e.g., 16x16 using Tensor Cores)
  • Thread level: Individual threads handle element-wise operations in the epilogue (rescaling, bias addition)

Pipelined execution: The multistage pipeline overlaps global memory loads (for the next tile) with shared memory computation (for the current tile), hiding memory latency:

// Abstract pipelined Evoformer attention pattern
template <int STAGES>
void evoformer_attention_kernel(Q, K, V, Bias, Output) {
    // Prologue: fill pipeline with first STAGES tiles
    for (int s = 0; s < STAGES; s++)
        async_load_to_smem(K_tile[s], V_tile[s]);

    float running_max = -INFINITY;
    float running_sum = 0;
    float accumulator[TILE_M][TILE_N] = {0};

    for (int tile = 0; tile < num_kv_tiles; tile++) {
        wait_for_smem(tile % STAGES);

        // Compute attention scores: S = Q * K^T + Bias
        mma_accumulate(S, Q_tile, K_smem[tile % STAGES]);
        add_bias(S, Bias_tile);

        // Online softmax update
        float tile_max = reduce_max(S);
        float rescale = exp(running_max - max(running_max, tile_max));
        running_max = max(running_max, tile_max);

        // Rescale previous accumulator and add new contribution
        scale(accumulator, rescale);
        softmax_and_accumulate(accumulator, S, V_smem[tile % STAGES]);

        // Prefetch next tile
        if (tile + STAGES < num_kv_tiles)
            async_load_to_smem(K_tile[(tile+STAGES) % STAGES],
                               V_tile[(tile+STAGES) % STAGES]);
    }

    // Final normalization
    scale(accumulator, 1.0 / running_sum);
    store(Output, accumulator);
}

Backward pass: The backward kernel recomputes attention weights from Q, K, and the stored log-sum-exp (rather than storing the full attention matrix), computing gradients for Q, K, V, and optionally the bias tensor. Atomic operations are used for bias gradient accumulation when multiple threadblocks contribute to the same bias element.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment