Implementation:Deepspeedai DeepSpeed Evoformer Epilogue Grad Bias
| Knowledge Sources | |
|---|---|
| Domains | Attention, CUTLASS_Kernels, DeepSpeed4Science |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Custom epilogue configurations for computing bias gradients with atomic accumulation in attention backward passes, supporting both standard and Volta tensor operations.
Description
BiasGradEpilogue provides template specializations that configure CUTLASS epilogues to use atomic tile iterators for safe gradient accumulation across multiple threadblocks. The structure wraps EpilogueTensorOp and EpilogueVoltaTensorOp with PredicatedTileIteratorAtomic replacements, enabling concurrent writes to bias gradients from different attention blocks. It supports both rank-2 tensors (standard matrix layouts) via BiasGradEpilogue and rank-N tensors (affine layouts) via BiasGradEpilogueAffineRankN. Architecture-specific specializations handle differences between Volta (SM70) and later architectures (SM75+) in terms of warp-level tile formats and fragment iterators. This abstraction allows the backward kernel to atomically accumulate bias gradients without explicit synchronization or workspace buffers.
Usage
This epilogue configuration is instantiated in the attention backward kernel when bias gradients need to be computed, routing GEMM results through atomic accumulation paths to safely aggregate contributions from multiple query/key blocks.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/deepspeed4science/evoformer_attn/epilogue/epilogue_grad_bias.h
Signature
// Generic epilogue for bias gradients (SM75+)
template <typename Arch_,
typename Shape_,
typename WarpMmaTensorOp_,
int PartitionsK,
typename OutputOp_,
int ElementsPerAccess,
bool ScatterD = false,
typename PermuteDLayout = cutlass::layout::NoPermute>
struct BiasGradEpilogue {
using Epilogue = typename cutlass::epilogue::threadblock::EpilogueTensorOp<
Shape_, WarpMmaTensorOp_, PartitionsK, OutputOp_,
ElementsPerAccess, ScatterD, PermuteDLayout>::Epilogue;
};
// Volta specialization (SM70)
template <typename Shape_, /* ... */>
struct BiasGradEpilogue<cutlass::arch::Sm70, /* ... */> {
using Epilogue = typename cutlass::epilogue::threadblock::EpilogueVoltaTensorOp<
Shape_, WarpMmaTensorOp_, /* ... */>::Epilogue;
};
// Affine rank-N tensor support
template <typename Arch_, int Rank, /* ... */>
struct BiasGradEpilogueAffineRankN {
using Epilogue = typename cutlass::epilogue::threadblock::EpilogueTensorOpAffineRankN<
Rank, Shape_, /* ... */>::Epilogue;
};
Import
#include "csrc/deepspeed4science/evoformer_attn/epilogue/epilogue_grad_bias.h"
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| Template Parameters | ||
| Arch_ | cutlass::arch::Sm* | Target architecture (Sm70, Sm75, Sm80) |
| Shape_ | GemmShape | Threadblock tile shape |
| WarpMmaTensorOp_ | MmaTensorOp | Warp-level MMA operator |
| OutputOp_ | Functor | Output transformation (usually identity for gradients) |
| Resulting Type | ||
| Epilogue | EpilogueType | Configured epilogue with atomic iterator |
Usage Examples
// Configure bias gradient epilogue for Ampere
using BiasGradEpilogue = BiasGradEpilogue<
cutlass::arch::Sm80, // Ampere
cutlass::gemm::GemmShape<64, 64, 32>, // Tile shape
WarpMmaTensorOp, // Warp MMA
1, // Single K partition
cutlass::epilogue::thread::LinearCombination< // Identity operation
cutlass::half_t, 8, float, float>,
8 // Elements per access
>;
// Use in GEMM for bias gradient computation
using GemmBiasGrad = cutlass::gemm::device::GemmUniversal<
/* ... operand configs ... */,
typename BiasGradEpilogue::Epilogue // Atomic epilogue
>;
// Launch GEMM - bias gradients atomically accumulated
GemmBiasGrad gemm;
gemm(args, dbias_ptr, /* ... */); // Multiple blocks safely write to dbias_ptr