Implementation:Deepspeedai DeepSpeed Evoformer Tile Access Iterator Residual

Knowledge Sources	DeepSpeed Evoformer Attention
Domains	Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated	2026-02-09 00:00 GMT

Overview

A specialized tile access iterator that handles residual tiles when the problem size is not evenly divisible by the tile size, with optimized predication logic.

Description

PredicatedTileAccessIteratorResidualLast provides efficient access to matrix tiles with support for partial tiles at boundaries. When the matrix dimensions are not exact multiples of the tile dimensions, the last tiles in each dimension require special handling to avoid out-of-bounds accesses. This iterator extends CUTLASS's standard tile access patterns with refined predication that only checks bounds for the final tile iteration, reducing predication overhead for interior tiles. The "ResidualLast" strategy applies bounds checking only when necessary, improving performance for common cases where most tiles are complete. This is critical for attention mechanisms where sequence lengths vary and rarely align perfectly with hardware tile sizes.

Usage

This iterator is used in Evoformer attention kernels when loading Q, K, V tensors from global memory with arbitrary sequence lengths, ensuring correct behavior for partial tiles while maintaining high performance for full tiles.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_access_iterator_residual_last.h

Signature

template <typename Shape_,             // Tile shape
          typename Element_,           // Element type
          typename Layout_,            // Memory layout
          int AdvanceRank,             // Rank to advance (0=column, 1=row)
          typename ThreadMap_,         // Thread map
          typename AccessType_,        // Vectorized access type
          bool Gather = false>         // Gather operation
class PredicatedTileAccessIteratorResidualLast {
public:
    using Shape = Shape_;
    using Element = Element_;
    using Layout = Layout_;
    using ThreadMap = ThreadMap_;
    using AccessType = AccessType_;

    using Index = typename Layout::Index;
    using LongIndex = typename Layout::LongIndex;
    using TensorCoord = typename Layout::TensorCoord;
    using TensorRef = TensorRef<Element, Layout>;

    using Mask = typename ThreadMap::Mask;
    using Fragment = Array<Element, ThreadMap::Iterations::kCount *
                                   ThreadMap::kElementsPerAccess>;
};

Import

#include "csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_access_iterator_residual_last.h"

I/O Contract

Parameter	Type	Description
Inputs
ptr	Element*	Pointer to tensor in global memory
extent	TensorCoord	Logical dimensions of tensor
thread_id	int	Thread index within threadblock
Outputs
Fragment	Array<Element, N>	Loaded tile data
Configuration
AdvanceRank	int	0 for column-major advance, 1 for row-major
residual_tile	bool	True when processing boundary tiles

Usage Examples

// Configure iterator for Q matrix with residual handling
using IteratorQ = cutlass::transform::threadblock::PredicatedTileAccessIteratorResidualLast<
    cutlass::MatrixShape<64, 32>,        // Tile shape
    cutlass::half_t,                      // FP16
    cutlass::layout::RowMajor,            // Row-major layout
    0,                                    // Advance along columns
    ThreadMap,                            // Thread mapping
    cutlass::AlignedArray<cutlass::half_t, 8>  // 128-bit loads
>;

// Instantiate with tensor reference
IteratorQ iterator(
    {layout, query_ptr},          // TensorRef
    {num_queries, head_dim},      // Extent
    thread_id,                    // Thread ID
    {query_offset, 0}             // Thread offset
);

// Load tile with automatic residual handling
typename IteratorQ::Fragment fragment;
iterator.load(fragment);  // Predicates applied only for residual tiles

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment