Implementation:Deepspeedai DeepSpeed Evoformer Epilogue Tile Iterator

Knowledge Sources	DeepSpeed Evoformer Attention
Domains	Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated	2026-02-09 00:00 GMT

Overview

A predicated tile iterator with prefetch support for efficiently loading and storing output tiles in epilogue operations with memory access coalescing.

Description

PredicatedTileIteratorPrefetch extends CUTLASS's standard predicated tile iterator with prefetching capabilities to improve memory bandwidth utilization during epilogue output operations. The iterator manages coordinate computation, bounds checking through predicates, and vectorized memory accesses according to a ThreadMap. It supports both load and store operations on output tensors in global memory, applying predicates to ensure out-of-bounds accesses are safely masked. The prefetch functionality allows issuing memory loads ahead of time to hide latency, particularly beneficial when the epilogue needs to read-modify-write existing output values. The iterator is parameterized by element type, thread mapping strategy, and optional scatter/gather patterns.

Usage

This iterator is used in the pipelined epilogue to read source tensors and write destination tensors during the final output stage of attention computation, particularly when accumulating results across multiple GEMM operations.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/deepspeed4science/evoformer_attn/iterators/epilogue_predicated_tile_iterator.h

Signature

template <typename ThreadMap_,       // Thread map (OutputTileThreadMap)
          typename Element_,         // Element data type
          bool ScatterD = false,     // Scatter D operand
          bool UseCUDAStore = false> // Use CUDA store
class PredicatedTileIteratorPrefetch {
public:
    using Shape = typename ThreadMap::Shape;
    using Element = Element_;
    using Layout = layout::RowMajor;
    using TensorRef = TensorRef<Element, Layout>;

    static int const kElementsPerAccess = ThreadMap::kElementsPerAccess;
    static int const kIterations = ThreadMap::Count::kTile;

    using Fragment = Array<Element, ThreadMap::Iterations::kColumn *
                                   ThreadMap::Iterations::kRow *
                                   ThreadMap::Iterations::kGroup *
                                   ThreadMap::Iterations::kCluster *
                                   ThreadMap::kElementsPerAccess>;

    using AccessType = AlignedArray<Element, ThreadMap::kElementsPerAccess>;

    struct Mask {
        bool predicates[ThreadMap::Iterations::kColumn];
    };
};

Import

#include "csrc/deepspeed4science/evoformer_attn/iterators/epilogue_predicated_tile_iterator.h"

I/O Contract

Method	Input	Output	Description
Constructor	ThreadMap params, TensorRef, extent, thread_id	Iterator instance	Initializes iterator with tensor reference and bounds
load	Fragment& (output)	void	Loads tile from global memory into fragment with predication
store	Fragment (input)	void	Stores fragment to global memory with predication
operator++	void	Iterator&	Advances iterator to next tile position
add_pointer_offset	LongIndex offset	void	Adds byte offset to internal pointer

Usage Examples

// Configure output tile iterator with prefetch
using OutputIterator = cutlass::epilogue::threadblock::PredicatedTileIteratorPrefetch<
    OutputTileThreadMap,     // Thread mapping
    cutlass::half_t,          // FP16 output
    false,                    // No scatter
    false                     // Standard stores
>;

// Load existing output for accumulation
typename OutputIterator::Params params(output_layout);
OutputIterator source_iterator(params, output_ptr, extent, thread_id);

typename OutputIterator::Fragment source_fragment;
source_iterator.load(source_fragment);  // Predicated load

// Apply epilogue operation and store
typename OutputIterator::Fragment dest_fragment = epilogue_op(accum, source_fragment);
source_iterator.store(dest_fragment);   // Predicated store

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment