Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed Evoformer Tile Access Iterator Residual

From Leeroopedia


Knowledge Sources
Domains Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated 2026-02-09 00:00 GMT

Overview

A specialized tile access iterator that handles residual tiles when the problem size is not evenly divisible by the tile size, with optimized predication logic.

Description

PredicatedTileAccessIteratorResidualLast provides efficient access to matrix tiles with support for partial tiles at boundaries. When the matrix dimensions are not exact multiples of the tile dimensions, the last tiles in each dimension require special handling to avoid out-of-bounds accesses. This iterator extends CUTLASS's standard tile access patterns with refined predication that only checks bounds for the final tile iteration, reducing predication overhead for interior tiles. The "ResidualLast" strategy applies bounds checking only when necessary, improving performance for common cases where most tiles are complete. This is critical for attention mechanisms where sequence lengths vary and rarely align perfectly with hardware tile sizes.

Usage

This iterator is used in Evoformer attention kernels when loading Q, K, V tensors from global memory with arbitrary sequence lengths, ensuring correct behavior for partial tiles while maintaining high performance for full tiles.

Code Reference

Source Location

Signature

template <typename Shape_,             // Tile shape
          typename Element_,           // Element type
          typename Layout_,            // Memory layout
          int AdvanceRank,             // Rank to advance (0=column, 1=row)
          typename ThreadMap_,         // Thread map
          typename AccessType_,        // Vectorized access type
          bool Gather = false>         // Gather operation
class PredicatedTileAccessIteratorResidualLast {
public:
    using Shape = Shape_;
    using Element = Element_;
    using Layout = Layout_;
    using ThreadMap = ThreadMap_;
    using AccessType = AccessType_;

    using Index = typename Layout::Index;
    using LongIndex = typename Layout::LongIndex;
    using TensorCoord = typename Layout::TensorCoord;
    using TensorRef = TensorRef<Element, Layout>;

    using Mask = typename ThreadMap::Mask;
    using Fragment = Array<Element, ThreadMap::Iterations::kCount *
                                   ThreadMap::kElementsPerAccess>;
};

Import

#include "csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_access_iterator_residual_last.h"

I/O Contract

Parameter Type Description
Inputs
ptr Element* Pointer to tensor in global memory
extent TensorCoord Logical dimensions of tensor
thread_id int Thread index within threadblock
Outputs
Fragment Array<Element, N> Loaded tile data
Configuration
AdvanceRank int 0 for column-major advance, 1 for row-major
residual_tile bool True when processing boundary tiles

Usage Examples

// Configure iterator for Q matrix with residual handling
using IteratorQ = cutlass::transform::threadblock::PredicatedTileAccessIteratorResidualLast<
    cutlass::MatrixShape<64, 32>,        // Tile shape
    cutlass::half_t,                      // FP16
    cutlass::layout::RowMajor,            // Row-major layout
    0,                                    // Advance along columns
    ThreadMap,                            // Thread mapping
    cutlass::AlignedArray<cutlass::half_t, 8>  // 128-bit loads
>;

// Instantiate with tensor reference
IteratorQ iterator(
    {layout, query_ptr},          // TensorRef
    {num_queries, head_dim},      // Extent
    thread_id,                    // Thread ID
    {query_offset, 0}             // Thread offset
);

// Load tile with automatic residual handling
typename IteratorQ::Fragment fragment;
iterator.load(fragment);  // Predicates applied only for residual tiles

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment