Implementation:Deepspeedai DeepSpeed Evoformer Epilogue Pipelined

Knowledge Sources	DeepSpeed Evoformer Attention
Domains	Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated	2026-02-09 00:00 GMT

Overview

A pipelined epilogue implementation for CUTLASS threadblock-scoped GEMM operations that loads two source fragments simultaneously and supports reading from different data types.

Description

EpiloguePipelined is a custom CUTLASS epilogue that extends the standard epilogue with three key modifications: (1) pipelined loading of source fragments for improved memory bandwidth utilization, (2) support for reading source tensors with different data types than the output, and (3) passing row IDs to the OutputOp for row-aware operations like normalization. This component is derived from CUTLASS's epilogue infrastructure and uses template metaprogramming to configure tile shapes, MMA operators, and memory access patterns. The epilogue coordinates warp-level tiles, shared memory buffers, and global memory stores through iterator abstractions.

Usage

This epilogue is used in the Evoformer attention forward and backward passes when accumulator results need to be written to global memory with custom transformations, particularly for operations requiring per-row normalization like softmax rescaling with log-sum-exp statistics.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/deepspeed4science/evoformer_attn/epilogue/epilogue_pipelined.h

Signature

template <typename Shape_,               // Shape of threadblock tile
          typename WarpMmaOperator_,     // Warp-level MMA operator
          int PartitionsK,               // Number of K dimension partitions
          typename OutputTileIterator_,  // Tile iterator for writing outputs
          typename AccumulatorFragmentIterator_,  // Fragment iterator for accumulators
          typename WarpTileIterator_,    // Warp-scoped tile iterator to SMEM
          typename SharedLoadIterator_,  // Threadblock-scoped tile iterator from SMEM
          typename OutputOp_,            // Output operator
          typename Padding_,             // Padding for SMEM bank conflicts
          int FragmentsPerPartition = 1,
          int IterationsUnroll = (!IsEpilogueFunctorHeavy<OutputOp_>::value),
          typename OutputTileSourceIterator_ = OutputTileIterator_>
class EpiloguePipelined : public EpilogueBase<Shape_,
                                              typename WarpMmaOperator_::Shape,
                                              PartitionsK,
                                              AccumulatorFragmentIterator_,
                                              WarpTileIterator_,
                                              Padding_,
                                              FragmentsPerPartition>;

Import

#include "csrc/deepspeed4science/evoformer_attn/epilogue/epilogue_pipelined.h"

I/O Contract

Parameter	Type	Description
Inputs
accumulators	AccumulatorTile	Warp-level accumulator fragments from GEMM
source_iterator	OutputTileSourceIterator	Iterator for loading existing output values
output_op	OutputOp	Functor applied to each output element
Outputs
destination	TensorRef	Global memory destination tensor

Usage Examples

// Instantiate epilogue with row-aware normalization operator
using Epilogue = cutlass::epilogue::threadblock::EpiloguePipelined<
    cutlass::gemm::GemmShape<64, 64, 32>,  // Threadblock tile shape
    WarpMma,                                // Warp MMA operator
    1,                                      // K partitions
    OutputTileIterator,                     // Output iterator
    AccumulatorFragmentIterator,            // Accumulator iterator
    WarpTileIterator,                       // Warp tile iterator
    SharedLoadIterator,                     // Shared memory loader
    MemoryEfficientAttentionNormalize<>,    // Custom output op with row ID
    cutlass::MatrixShape<1, 4>              // Padding
>;

// Execute epilogue
epilogue(output_op, destination_iterator, accumulators, source_iterator);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment