Principle:Deepspeedai DeepSpeed Evoformer Attention Kernels
| Knowledge Sources | |
|---|---|
| Domains | Scientific_Computing, Attention_Mechanisms, CUDA_Kernels |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Memory-efficient CUTLASS-based CUDA kernels implementing the attention mechanism for Evoformer architectures used in protein structure prediction models.
Description
Evoformer Attention Kernels implement a highly optimized, memory-efficient attention mechanism tailored for the Evoformer block architecture used in AlphaFold2 and related protein structure prediction models. Unlike standard Transformer attention that operates on 2D sequences, Evoformer attention operates on pair representations with additional bias terms that encode evolutionary and structural relationships.
Built on NVIDIA's CUTLASS template library, these kernels provide:
- Custom threadblock-scoped GEMM operations: Specialized matrix-multiply-accumulate (MMA) implementations that fuse the attention score computation (Q*K^T) and value projection (Attn*V) with softmax normalization in a single kernel launch, avoiding materialization of the full attention matrix
- Pipelined epilogue stages: Multi-stage epilogue pipelines that overlap memory writes with computation, supporting rescaling (for softmax normalization), gradient bias accumulation, and output formatting
- Specialized tile iterators: Custom iterators for accessing pair representation tensors with residual tile handling (for dimensions that are not multiples of the tile size), atomic gradient accumulation for backward pass parallelism, and shared-memory-backed warp-level iteration
- Bias broadcasting: Efficient broadcasting of pair bias tensors across the batch and head dimensions of the attention computation
- Log-sum-exp tracking: Numerically stable log-sum-exp computation for the softmax denominator, enabling memory-efficient backward pass without storing the full attention matrix
- Forward and backward kernels: Complete differentiable implementation with optimized forward and backward passes
Usage
These kernels are invoked through DeepSpeed's Evoformer attention operator, typically used when training or running inference on AlphaFold2-style models. The operator is JIT-compiled via the EvoformerAttnBuilder op builder and requires NVIDIA GPU with compute capability >= 7.0 (Volta or newer) and CUTLASS headers.
Theoretical Basis
Evoformer attention extends standard multi-head attention with pair bias terms. For query Q, key K, value V, and pair bias B, the attention output is:
Attention(Q, K, V, B) = softmax(Q * K^T / sqrt(d_k) + B) * V
Memory-efficient formulation: Standard attention materializes the full N x N attention matrix, requiring O(N^2) memory. The memory-efficient approach computes attention in tiles, maintaining only a running log-sum-exp for softmax normalization:
- Divide Q into blocks of size B_q and K, V into blocks of size B_kv
- For each Q block, iterate over K/V blocks, accumulating weighted values and updating the log-sum-exp
- Rescale the accumulated output using the final log-sum-exp
This reduces memory from O(N^2) to O(N * B_q), enabling longer sequences.
CUTLASS tiling hierarchy:
- Threadblock level: Each threadblock computes a tile of the output (e.g., 64x64)
- Warp level: Within a threadblock, warps execute MMA instructions on sub-tiles (e.g., 16x16 using Tensor Cores)
- Thread level: Individual threads handle element-wise operations in the epilogue (rescaling, bias addition)
Pipelined execution: The multistage pipeline overlaps global memory loads (for the next tile) with shared memory computation (for the current tile), hiding memory latency:
// Abstract pipelined Evoformer attention pattern
template <int STAGES>
void evoformer_attention_kernel(Q, K, V, Bias, Output) {
// Prologue: fill pipeline with first STAGES tiles
for (int s = 0; s < STAGES; s++)
async_load_to_smem(K_tile[s], V_tile[s]);
float running_max = -INFINITY;
float running_sum = 0;
float accumulator[TILE_M][TILE_N] = {0};
for (int tile = 0; tile < num_kv_tiles; tile++) {
wait_for_smem(tile % STAGES);
// Compute attention scores: S = Q * K^T + Bias
mma_accumulate(S, Q_tile, K_smem[tile % STAGES]);
add_bias(S, Bias_tile);
// Online softmax update
float tile_max = reduce_max(S);
float rescale = exp(running_max - max(running_max, tile_max));
running_max = max(running_max, tile_max);
// Rescale previous accumulator and add new contribution
scale(accumulator, rescale);
softmax_and_accumulate(accumulator, S, V_smem[tile % STAGES]);
// Prefetch next tile
if (tile + STAGES < num_kv_tiles)
async_load_to_smem(K_tile[(tile+STAGES) % STAGES],
V_tile[(tile+STAGES) % STAGES]);
}
// Final normalization
scale(accumulator, 1.0 / running_sum);
store(Output, accumulator);
}
Backward pass: The backward kernel recomputes attention weights from Q, K, and the stored log-sum-exp (rather than storing the full attention matrix), computing gradients for Q, K, V, and optionally the bias tensor. Atomic operations are used for bias gradient accumulation when multiple threadblocks contribute to the same bias element.
Related Pages
Implemented By
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Epilogue_Pipelined — Multi-stage pipelined epilogue for overlapped output writes
- Implementation:Deepspeedai_DeepSpeed_Evoformer_MMA_Multistage — Multi-stage MMA pipeline for attention score computation
- Implementation:Deepspeedai_DeepSpeed_Evoformer_MMA_Pipelined — Two-stage pipelined MMA implementation
- Implementation:Deepspeedai_DeepSpeed_Evoformer_MMA_Accum_Lambda — Lambda-based accumulator for flexible MMA output processing
- Implementation:Deepspeedai_DeepSpeed_Evoformer_MMA_From_Smem — Shared-memory-sourced MMA operand loading
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Epilogue_Tile_Iterator — Custom tile iterator for epilogue output traversal
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Tile_Access_Iterator_Residual — Tile iterator with residual handling for non-aligned dimensions
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Tile_Iterator_Atomic — Atomic tile iterator for concurrent gradient accumulation
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Tile_Iterator_Residual — Residual-aware tile iterator for boundary tiles
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Kernel_Backward — Complete backward pass kernel for attention gradients
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Kernel_Forward — Complete forward pass kernel for attention computation
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Epilogue_Grad_Bias — Epilogue stage for bias gradient computation
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Epilogue_Rescale — Epilogue stage for online softmax rescaling
- Implementation:Deepspeedai_DeepSpeed_Evoformer_LogSumExp — Numerically stable log-sum-exp computation
- Implementation:Deepspeedai_DeepSpeed_Evoformer_GEMM_Utils — CUTLASS GEMM configuration and utility helpers
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Warp_Iterator_Smem — Warp-level shared memory iterator for MMA operands
- Implementation:Deepspeedai_DeepSpeed_Evoformer_Bias_Broadcast — Pair bias broadcasting across batch and head dimensions