Implementation:Deepspeedai DeepSpeed Evoformer Epilogue Pipelined
| Knowledge Sources | |
|---|---|
| Domains | Attention, CUTLASS_Kernels, DeepSpeed4Science |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A pipelined epilogue implementation for CUTLASS threadblock-scoped GEMM operations that loads two source fragments simultaneously and supports reading from different data types.
Description
EpiloguePipelined is a custom CUTLASS epilogue that extends the standard epilogue with three key modifications: (1) pipelined loading of source fragments for improved memory bandwidth utilization, (2) support for reading source tensors with different data types than the output, and (3) passing row IDs to the OutputOp for row-aware operations like normalization. This component is derived from CUTLASS's epilogue infrastructure and uses template metaprogramming to configure tile shapes, MMA operators, and memory access patterns. The epilogue coordinates warp-level tiles, shared memory buffers, and global memory stores through iterator abstractions.
Usage
This epilogue is used in the Evoformer attention forward and backward passes when accumulator results need to be written to global memory with custom transformations, particularly for operations requiring per-row normalization like softmax rescaling with log-sum-exp statistics.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/deepspeed4science/evoformer_attn/epilogue/epilogue_pipelined.h
Signature
template <typename Shape_, // Shape of threadblock tile
typename WarpMmaOperator_, // Warp-level MMA operator
int PartitionsK, // Number of K dimension partitions
typename OutputTileIterator_, // Tile iterator for writing outputs
typename AccumulatorFragmentIterator_, // Fragment iterator for accumulators
typename WarpTileIterator_, // Warp-scoped tile iterator to SMEM
typename SharedLoadIterator_, // Threadblock-scoped tile iterator from SMEM
typename OutputOp_, // Output operator
typename Padding_, // Padding for SMEM bank conflicts
int FragmentsPerPartition = 1,
int IterationsUnroll = (!IsEpilogueFunctorHeavy<OutputOp_>::value),
typename OutputTileSourceIterator_ = OutputTileIterator_>
class EpiloguePipelined : public EpilogueBase<Shape_,
typename WarpMmaOperator_::Shape,
PartitionsK,
AccumulatorFragmentIterator_,
WarpTileIterator_,
Padding_,
FragmentsPerPartition>;
Import
#include "csrc/deepspeed4science/evoformer_attn/epilogue/epilogue_pipelined.h"
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| Inputs | ||
| accumulators | AccumulatorTile | Warp-level accumulator fragments from GEMM |
| source_iterator | OutputTileSourceIterator | Iterator for loading existing output values |
| output_op | OutputOp | Functor applied to each output element |
| Outputs | ||
| destination | TensorRef | Global memory destination tensor |
Usage Examples
// Instantiate epilogue with row-aware normalization operator
using Epilogue = cutlass::epilogue::threadblock::EpiloguePipelined<
cutlass::gemm::GemmShape<64, 64, 32>, // Threadblock tile shape
WarpMma, // Warp MMA operator
1, // K partitions
OutputTileIterator, // Output iterator
AccumulatorFragmentIterator, // Accumulator iterator
WarpTileIterator, // Warp tile iterator
SharedLoadIterator, // Shared memory loader
MemoryEfficientAttentionNormalize<>, // Custom output op with row ID
cutlass::MatrixShape<1, 4> // Padding
>;
// Execute epilogue
epilogue(output_op, destination_iterator, accumulators, source_iterator);