Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed Evoformer Epilogue Grad Bias

From Leeroopedia


Knowledge Sources
Domains Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated 2026-02-09 00:00 GMT

Overview

Custom epilogue configurations for computing bias gradients with atomic accumulation in attention backward passes, supporting both standard and Volta tensor operations.

Description

BiasGradEpilogue provides template specializations that configure CUTLASS epilogues to use atomic tile iterators for safe gradient accumulation across multiple threadblocks. The structure wraps EpilogueTensorOp and EpilogueVoltaTensorOp with PredicatedTileIteratorAtomic replacements, enabling concurrent writes to bias gradients from different attention blocks. It supports both rank-2 tensors (standard matrix layouts) via BiasGradEpilogue and rank-N tensors (affine layouts) via BiasGradEpilogueAffineRankN. Architecture-specific specializations handle differences between Volta (SM70) and later architectures (SM75+) in terms of warp-level tile formats and fragment iterators. This abstraction allows the backward kernel to atomically accumulate bias gradients without explicit synchronization or workspace buffers.

Usage

This epilogue configuration is instantiated in the attention backward kernel when bias gradients need to be computed, routing GEMM results through atomic accumulation paths to safely aggregate contributions from multiple query/key blocks.

Code Reference

Source Location

Signature

// Generic epilogue for bias gradients (SM75+)
template <typename Arch_,
          typename Shape_,
          typename WarpMmaTensorOp_,
          int PartitionsK,
          typename OutputOp_,
          int ElementsPerAccess,
          bool ScatterD = false,
          typename PermuteDLayout = cutlass::layout::NoPermute>
struct BiasGradEpilogue {
    using Epilogue = typename cutlass::epilogue::threadblock::EpilogueTensorOp<
        Shape_, WarpMmaTensorOp_, PartitionsK, OutputOp_,
        ElementsPerAccess, ScatterD, PermuteDLayout>::Epilogue;
};

// Volta specialization (SM70)
template <typename Shape_, /* ... */>
struct BiasGradEpilogue<cutlass::arch::Sm70, /* ... */> {
    using Epilogue = typename cutlass::epilogue::threadblock::EpilogueVoltaTensorOp<
        Shape_, WarpMmaTensorOp_, /* ... */>::Epilogue;
};

// Affine rank-N tensor support
template <typename Arch_, int Rank, /* ... */>
struct BiasGradEpilogueAffineRankN {
    using Epilogue = typename cutlass::epilogue::threadblock::EpilogueTensorOpAffineRankN<
        Rank, Shape_, /* ... */>::Epilogue;
};

Import

#include "csrc/deepspeed4science/evoformer_attn/epilogue/epilogue_grad_bias.h"

I/O Contract

Parameter Type Description
Template Parameters
Arch_ cutlass::arch::Sm* Target architecture (Sm70, Sm75, Sm80)
Shape_ GemmShape Threadblock tile shape
WarpMmaTensorOp_ MmaTensorOp Warp-level MMA operator
OutputOp_ Functor Output transformation (usually identity for gradients)
Resulting Type
Epilogue EpilogueType Configured epilogue with atomic iterator

Usage Examples

// Configure bias gradient epilogue for Ampere
using BiasGradEpilogue = BiasGradEpilogue<
    cutlass::arch::Sm80,                              // Ampere
    cutlass::gemm::GemmShape<64, 64, 32>,             // Tile shape
    WarpMmaTensorOp,                                  // Warp MMA
    1,                                                // Single K partition
    cutlass::epilogue::thread::LinearCombination<     // Identity operation
        cutlass::half_t, 8, float, float>,
    8                                                 // Elements per access
>;

// Use in GEMM for bias gradient computation
using GemmBiasGrad = cutlass::gemm::device::GemmUniversal<
    /* ... operand configs ... */,
    typename BiasGradEpilogue::Epilogue               // Atomic epilogue
>;

// Launch GEMM - bias gradients atomically accumulated
GemmBiasGrad gemm;
gemm(args, dbias_ptr, /* ... */);  // Multiple blocks safely write to dbias_ptr

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment