Implementation:Deepspeedai DeepSpeed Evoformer Epilogue Grad Bias

Knowledge Sources	DeepSpeed Evoformer Attention
Domains	Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated	2026-02-09 00:00 GMT

Overview

Custom epilogue configurations for computing bias gradients with atomic accumulation in attention backward passes, supporting both standard and Volta tensor operations.

Description

BiasGradEpilogue provides template specializations that configure CUTLASS epilogues to use atomic tile iterators for safe gradient accumulation across multiple threadblocks. The structure wraps EpilogueTensorOp and EpilogueVoltaTensorOp with PredicatedTileIteratorAtomic replacements, enabling concurrent writes to bias gradients from different attention blocks. It supports both rank-2 tensors (standard matrix layouts) via BiasGradEpilogue and rank-N tensors (affine layouts) via BiasGradEpilogueAffineRankN. Architecture-specific specializations handle differences between Volta (SM70) and later architectures (SM75+) in terms of warp-level tile formats and fragment iterators. This abstraction allows the backward kernel to atomically accumulate bias gradients without explicit synchronization or workspace buffers.

Usage

This epilogue configuration is instantiated in the attention backward kernel when bias gradients need to be computed, routing GEMM results through atomic accumulation paths to safely aggregate contributions from multiple query/key blocks.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/deepspeed4science/evoformer_attn/epilogue/epilogue_grad_bias.h

Signature

// Generic epilogue for bias gradients (SM75+)
template <typename Arch_,
          typename Shape_,
          typename WarpMmaTensorOp_,
          int PartitionsK,
          typename OutputOp_,
          int ElementsPerAccess,
          bool ScatterD = false,
          typename PermuteDLayout = cutlass::layout::NoPermute>
struct BiasGradEpilogue {
    using Epilogue = typename cutlass::epilogue::threadblock::EpilogueTensorOp<
        Shape_, WarpMmaTensorOp_, PartitionsK, OutputOp_,
        ElementsPerAccess, ScatterD, PermuteDLayout>::Epilogue;
};

// Volta specialization (SM70)
template <typename Shape_, /* ... */>
struct BiasGradEpilogue<cutlass::arch::Sm70, /* ... */> {
    using Epilogue = typename cutlass::epilogue::threadblock::EpilogueVoltaTensorOp<
        Shape_, WarpMmaTensorOp_, /* ... */>::Epilogue;
};

// Affine rank-N tensor support
template <typename Arch_, int Rank, /* ... */>
struct BiasGradEpilogueAffineRankN {
    using Epilogue = typename cutlass::epilogue::threadblock::EpilogueTensorOpAffineRankN<
        Rank, Shape_, /* ... */>::Epilogue;
};

Import

#include "csrc/deepspeed4science/evoformer_attn/epilogue/epilogue_grad_bias.h"

I/O Contract

Parameter	Type	Description
Template Parameters
Arch_	cutlass::arch::Sm*	Target architecture (Sm70, Sm75, Sm80)
Shape_	GemmShape	Threadblock tile shape
WarpMmaTensorOp_	MmaTensorOp	Warp-level MMA operator
OutputOp_	Functor	Output transformation (usually identity for gradients)
Resulting Type
Epilogue	EpilogueType	Configured epilogue with atomic iterator

Usage Examples

// Configure bias gradient epilogue for Ampere
using BiasGradEpilogue = BiasGradEpilogue<
    cutlass::arch::Sm80,                              // Ampere
    cutlass::gemm::GemmShape<64, 64, 32>,             // Tile shape
    WarpMmaTensorOp,                                  // Warp MMA
    1,                                                // Single K partition
    cutlass::epilogue::thread::LinearCombination<     // Identity operation
        cutlass::half_t, 8, float, float>,
    8                                                 // Elements per access
>;

// Use in GEMM for bias gradient computation
using GemmBiasGrad = cutlass::gemm::device::GemmUniversal<
    /* ... operand configs ... */,
    typename BiasGradEpilogue::Epilogue               // Atomic epilogue
>;

// Launch GEMM - bias gradients atomically accumulated
GemmBiasGrad gemm;
gemm(args, dbias_ptr, /* ... */);  // Multiple blocks safely write to dbias_ptr

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment