Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed Evoformer Epilogue Tile Iterator

From Leeroopedia


Knowledge Sources
Domains Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated 2026-02-09 00:00 GMT

Overview

A predicated tile iterator with prefetch support for efficiently loading and storing output tiles in epilogue operations with memory access coalescing.

Description

PredicatedTileIteratorPrefetch extends CUTLASS's standard predicated tile iterator with prefetching capabilities to improve memory bandwidth utilization during epilogue output operations. The iterator manages coordinate computation, bounds checking through predicates, and vectorized memory accesses according to a ThreadMap. It supports both load and store operations on output tensors in global memory, applying predicates to ensure out-of-bounds accesses are safely masked. The prefetch functionality allows issuing memory loads ahead of time to hide latency, particularly beneficial when the epilogue needs to read-modify-write existing output values. The iterator is parameterized by element type, thread mapping strategy, and optional scatter/gather patterns.

Usage

This iterator is used in the pipelined epilogue to read source tensors and write destination tensors during the final output stage of attention computation, particularly when accumulating results across multiple GEMM operations.

Code Reference

Source Location

Signature

template <typename ThreadMap_,       // Thread map (OutputTileThreadMap)
          typename Element_,         // Element data type
          bool ScatterD = false,     // Scatter D operand
          bool UseCUDAStore = false> // Use CUDA store
class PredicatedTileIteratorPrefetch {
public:
    using Shape = typename ThreadMap::Shape;
    using Element = Element_;
    using Layout = layout::RowMajor;
    using TensorRef = TensorRef<Element, Layout>;

    static int const kElementsPerAccess = ThreadMap::kElementsPerAccess;
    static int const kIterations = ThreadMap::Count::kTile;

    using Fragment = Array<Element, ThreadMap::Iterations::kColumn *
                                   ThreadMap::Iterations::kRow *
                                   ThreadMap::Iterations::kGroup *
                                   ThreadMap::Iterations::kCluster *
                                   ThreadMap::kElementsPerAccess>;

    using AccessType = AlignedArray<Element, ThreadMap::kElementsPerAccess>;

    struct Mask {
        bool predicates[ThreadMap::Iterations::kColumn];
    };
};

Import

#include "csrc/deepspeed4science/evoformer_attn/iterators/epilogue_predicated_tile_iterator.h"

I/O Contract

Method Input Output Description
Constructor ThreadMap params, TensorRef, extent, thread_id Iterator instance Initializes iterator with tensor reference and bounds
load Fragment& (output) void Loads tile from global memory into fragment with predication
store Fragment (input) void Stores fragment to global memory with predication
operator++ void Iterator& Advances iterator to next tile position
add_pointer_offset LongIndex offset void Adds byte offset to internal pointer

Usage Examples

// Configure output tile iterator with prefetch
using OutputIterator = cutlass::epilogue::threadblock::PredicatedTileIteratorPrefetch<
    OutputTileThreadMap,     // Thread mapping
    cutlass::half_t,          // FP16 output
    false,                    // No scatter
    false                     // Standard stores
>;

// Load existing output for accumulation
typename OutputIterator::Params params(output_layout);
OutputIterator source_iterator(params, output_ptr, extent, thread_id);

typename OutputIterator::Fragment source_fragment;
source_iterator.load(source_fragment);  // Predicated load

// Apply epilogue operation and store
typename OutputIterator::Fragment dest_fragment = epilogue_op(accum, source_fragment);
source_iterator.store(dest_fragment);   // Predicated store

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment