Implementation:Deepspeedai DeepSpeed Evoformer Epilogue Tile Iterator
| Knowledge Sources | |
|---|---|
| Domains | Attention, CUTLASS_Kernels, DeepSpeed4Science |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A predicated tile iterator with prefetch support for efficiently loading and storing output tiles in epilogue operations with memory access coalescing.
Description
PredicatedTileIteratorPrefetch extends CUTLASS's standard predicated tile iterator with prefetching capabilities to improve memory bandwidth utilization during epilogue output operations. The iterator manages coordinate computation, bounds checking through predicates, and vectorized memory accesses according to a ThreadMap. It supports both load and store operations on output tensors in global memory, applying predicates to ensure out-of-bounds accesses are safely masked. The prefetch functionality allows issuing memory loads ahead of time to hide latency, particularly beneficial when the epilogue needs to read-modify-write existing output values. The iterator is parameterized by element type, thread mapping strategy, and optional scatter/gather patterns.
Usage
This iterator is used in the pipelined epilogue to read source tensors and write destination tensors during the final output stage of attention computation, particularly when accumulating results across multiple GEMM operations.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/deepspeed4science/evoformer_attn/iterators/epilogue_predicated_tile_iterator.h
Signature
template <typename ThreadMap_, // Thread map (OutputTileThreadMap)
typename Element_, // Element data type
bool ScatterD = false, // Scatter D operand
bool UseCUDAStore = false> // Use CUDA store
class PredicatedTileIteratorPrefetch {
public:
using Shape = typename ThreadMap::Shape;
using Element = Element_;
using Layout = layout::RowMajor;
using TensorRef = TensorRef<Element, Layout>;
static int const kElementsPerAccess = ThreadMap::kElementsPerAccess;
static int const kIterations = ThreadMap::Count::kTile;
using Fragment = Array<Element, ThreadMap::Iterations::kColumn *
ThreadMap::Iterations::kRow *
ThreadMap::Iterations::kGroup *
ThreadMap::Iterations::kCluster *
ThreadMap::kElementsPerAccess>;
using AccessType = AlignedArray<Element, ThreadMap::kElementsPerAccess>;
struct Mask {
bool predicates[ThreadMap::Iterations::kColumn];
};
};
Import
#include "csrc/deepspeed4science/evoformer_attn/iterators/epilogue_predicated_tile_iterator.h"
I/O Contract
| Method | Input | Output | Description |
|---|---|---|---|
| Constructor | ThreadMap params, TensorRef, extent, thread_id | Iterator instance | Initializes iterator with tensor reference and bounds |
| load | Fragment& (output) | void | Loads tile from global memory into fragment with predication |
| store | Fragment (input) | void | Stores fragment to global memory with predication |
| operator++ | void | Iterator& | Advances iterator to next tile position |
| add_pointer_offset | LongIndex offset | void | Adds byte offset to internal pointer |
Usage Examples
// Configure output tile iterator with prefetch
using OutputIterator = cutlass::epilogue::threadblock::PredicatedTileIteratorPrefetch<
OutputTileThreadMap, // Thread mapping
cutlass::half_t, // FP16 output
false, // No scatter
false // Standard stores
>;
// Load existing output for accumulation
typename OutputIterator::Params params(output_layout);
OutputIterator source_iterator(params, output_ptr, extent, thread_id);
typename OutputIterator::Fragment source_fragment;
source_iterator.load(source_fragment); // Predicated load
// Apply epilogue operation and store
typename OutputIterator::Fragment dest_fragment = epilogue_op(accum, source_fragment);
source_iterator.store(dest_fragment); // Predicated store