Principle:Deepspeedai DeepSpeed Evoformer Attention Kernels

Knowledge Sources	DeepSpeed Highly Accurate Protein Structure Prediction with AlphaFold
Domains	Scientific_Computing, Attention_Mechanisms, CUDA_Kernels
Last Updated	2026-02-09 00:00 GMT

Overview

Memory-efficient CUTLASS-based CUDA kernels implementing the attention mechanism for Evoformer architectures used in protein structure prediction models.

Description

Evoformer Attention Kernels implement a highly optimized, memory-efficient attention mechanism tailored for the Evoformer block architecture used in AlphaFold2 and related protein structure prediction models. Unlike standard Transformer attention that operates on 2D sequences, Evoformer attention operates on pair representations with additional bias terms that encode evolutionary and structural relationships.

Built on NVIDIA's CUTLASS template library, these kernels provide:

Custom threadblock-scoped GEMM operations: Specialized matrix-multiply-accumulate (MMA) implementations that fuse the attention score computation (Q*K^T) and value projection (Attn*V) with softmax normalization in a single kernel launch, avoiding materialization of the full attention matrix
Pipelined epilogue stages: Multi-stage epilogue pipelines that overlap memory writes with computation, supporting rescaling (for softmax normalization), gradient bias accumulation, and output formatting
Specialized tile iterators: Custom iterators for accessing pair representation tensors with residual tile handling (for dimensions that are not multiples of the tile size), atomic gradient accumulation for backward pass parallelism, and shared-memory-backed warp-level iteration
Bias broadcasting: Efficient broadcasting of pair bias tensors across the batch and head dimensions of the attention computation
Log-sum-exp tracking: Numerically stable log-sum-exp computation for the softmax denominator, enabling memory-efficient backward pass without storing the full attention matrix
Forward and backward kernels: Complete differentiable implementation with optimized forward and backward passes

Usage

These kernels are invoked through DeepSpeed's Evoformer attention operator, typically used when training or running inference on AlphaFold2-style models. The operator is JIT-compiled via the EvoformerAttnBuilder op builder and requires NVIDIA GPU with compute capability >= 7.0 (Volta or newer) and CUTLASS headers.

Theoretical Basis

Evoformer attention extends standard multi-head attention with pair bias terms. For query Q, key K, value V, and pair bias B, the attention output is:

Attention(Q, K, V, B) = softmax(Q * K^T / sqrt(d_k) + B) * V

Memory-efficient formulation: Standard attention materializes the full N x N attention matrix, requiring O(N^2) memory. The memory-efficient approach computes attention in tiles, maintaining only a running log-sum-exp for softmax normalization:

Divide Q into blocks of size B_q and K, V into blocks of size B_kv
For each Q block, iterate over K/V blocks, accumulating weighted values and updating the log-sum-exp
Rescale the accumulated output using the final log-sum-exp

This reduces memory from O(N^2) to O(N * B_q), enabling longer sequences.

CUTLASS tiling hierarchy:

Threadblock level: Each threadblock computes a tile of the output (e.g., 64x64)
Warp level: Within a threadblock, warps execute MMA instructions on sub-tiles (e.g., 16x16 using Tensor Cores)
Thread level: Individual threads handle element-wise operations in the epilogue (rescaling, bias addition)

Pipelined execution: The multistage pipeline overlaps global memory loads (for the next tile) with shared memory computation (for the current tile), hiding memory latency:

// Abstract pipelined Evoformer attention pattern
template <int STAGES>
void evoformer_attention_kernel(Q, K, V, Bias, Output) {
    // Prologue: fill pipeline with first STAGES tiles
    for (int s = 0; s < STAGES; s++)
        async_load_to_smem(K_tile[s], V_tile[s]);

    float running_max = -INFINITY;
    float running_sum = 0;
    float accumulator[TILE_M][TILE_N] = {0};

    for (int tile = 0; tile < num_kv_tiles; tile++) {
        wait_for_smem(tile % STAGES);

        // Compute attention scores: S = Q * K^T + Bias
        mma_accumulate(S, Q_tile, K_smem[tile % STAGES]);
        add_bias(S, Bias_tile);

        // Online softmax update
        float tile_max = reduce_max(S);
        float rescale = exp(running_max - max(running_max, tile_max));
        running_max = max(running_max, tile_max);

        // Rescale previous accumulator and add new contribution
        scale(accumulator, rescale);
        softmax_and_accumulate(accumulator, S, V_smem[tile % STAGES]);

        // Prefetch next tile
        if (tile + STAGES < num_kv_tiles)
            async_load_to_smem(K_tile[(tile+STAGES) % STAGES],
                               V_tile[(tile+STAGES) % STAGES]);
    }

    // Final normalization
    scale(accumulator, 1.0 / running_sum);
    store(Output, accumulator);
}

Backward pass: The backward kernel recomputes attention weights from Q, K, and the stored log-sum-exp (rather than storing the full attention matrix), computing gradients for Q, K, V, and optionally the bias tensor. Atomic operations are used for bias gradient accumulation when multiple threadblocks contribute to the same bias element.

Related Pages

Implemented By

Implementation:Deepspeedai_DeepSpeed_Evoformer_Epilogue_Pipelined — Multi-stage pipelined epilogue for overlapped output writes
Implementation:Deepspeedai_DeepSpeed_Evoformer_MMA_Multistage — Multi-stage MMA pipeline for attention score computation
Implementation:Deepspeedai_DeepSpeed_Evoformer_MMA_Pipelined — Two-stage pipelined MMA implementation
Implementation:Deepspeedai_DeepSpeed_Evoformer_MMA_Accum_Lambda — Lambda-based accumulator for flexible MMA output processing
Implementation:Deepspeedai_DeepSpeed_Evoformer_MMA_From_Smem — Shared-memory-sourced MMA operand loading
Implementation:Deepspeedai_DeepSpeed_Evoformer_Epilogue_Tile_Iterator — Custom tile iterator for epilogue output traversal
Implementation:Deepspeedai_DeepSpeed_Evoformer_Tile_Access_Iterator_Residual — Tile iterator with residual handling for non-aligned dimensions
Implementation:Deepspeedai_DeepSpeed_Evoformer_Tile_Iterator_Atomic — Atomic tile iterator for concurrent gradient accumulation
Implementation:Deepspeedai_DeepSpeed_Evoformer_Tile_Iterator_Residual — Residual-aware tile iterator for boundary tiles
Implementation:Deepspeedai_DeepSpeed_Evoformer_Kernel_Backward — Complete backward pass kernel for attention gradients
Implementation:Deepspeedai_DeepSpeed_Evoformer_Kernel_Forward — Complete forward pass kernel for attention computation
Implementation:Deepspeedai_DeepSpeed_Evoformer_Epilogue_Grad_Bias — Epilogue stage for bias gradient computation
Implementation:Deepspeedai_DeepSpeed_Evoformer_Epilogue_Rescale — Epilogue stage for online softmax rescaling
Implementation:Deepspeedai_DeepSpeed_Evoformer_LogSumExp — Numerically stable log-sum-exp computation
Implementation:Deepspeedai_DeepSpeed_Evoformer_GEMM_Utils — CUTLASS GEMM configuration and utility helpers
Implementation:Deepspeedai_DeepSpeed_Evoformer_Warp_Iterator_Smem — Warp-level shared memory iterator for MMA operands
Implementation:Deepspeedai_DeepSpeed_Evoformer_Bias_Broadcast — Pair bias broadcasting across batch and head dimensions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment