Implementation:Deepspeedai DeepSpeed Evoformer Tile Access Iterator Residual
| Knowledge Sources | |
|---|---|
| Domains | Attention, CUTLASS_Kernels, DeepSpeed4Science |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A specialized tile access iterator that handles residual tiles when the problem size is not evenly divisible by the tile size, with optimized predication logic.
Description
PredicatedTileAccessIteratorResidualLast provides efficient access to matrix tiles with support for partial tiles at boundaries. When the matrix dimensions are not exact multiples of the tile dimensions, the last tiles in each dimension require special handling to avoid out-of-bounds accesses. This iterator extends CUTLASS's standard tile access patterns with refined predication that only checks bounds for the final tile iteration, reducing predication overhead for interior tiles. The "ResidualLast" strategy applies bounds checking only when necessary, improving performance for common cases where most tiles are complete. This is critical for attention mechanisms where sequence lengths vary and rarely align perfectly with hardware tile sizes.
Usage
This iterator is used in Evoformer attention kernels when loading Q, K, V tensors from global memory with arbitrary sequence lengths, ensuring correct behavior for partial tiles while maintaining high performance for full tiles.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_access_iterator_residual_last.h
Signature
template <typename Shape_, // Tile shape
typename Element_, // Element type
typename Layout_, // Memory layout
int AdvanceRank, // Rank to advance (0=column, 1=row)
typename ThreadMap_, // Thread map
typename AccessType_, // Vectorized access type
bool Gather = false> // Gather operation
class PredicatedTileAccessIteratorResidualLast {
public:
using Shape = Shape_;
using Element = Element_;
using Layout = Layout_;
using ThreadMap = ThreadMap_;
using AccessType = AccessType_;
using Index = typename Layout::Index;
using LongIndex = typename Layout::LongIndex;
using TensorCoord = typename Layout::TensorCoord;
using TensorRef = TensorRef<Element, Layout>;
using Mask = typename ThreadMap::Mask;
using Fragment = Array<Element, ThreadMap::Iterations::kCount *
ThreadMap::kElementsPerAccess>;
};
Import
#include "csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_access_iterator_residual_last.h"
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| Inputs | ||
| ptr | Element* | Pointer to tensor in global memory |
| extent | TensorCoord | Logical dimensions of tensor |
| thread_id | int | Thread index within threadblock |
| Outputs | ||
| Fragment | Array<Element, N> | Loaded tile data |
| Configuration | ||
| AdvanceRank | int | 0 for column-major advance, 1 for row-major |
| residual_tile | bool | True when processing boundary tiles |
Usage Examples
// Configure iterator for Q matrix with residual handling
using IteratorQ = cutlass::transform::threadblock::PredicatedTileAccessIteratorResidualLast<
cutlass::MatrixShape<64, 32>, // Tile shape
cutlass::half_t, // FP16
cutlass::layout::RowMajor, // Row-major layout
0, // Advance along columns
ThreadMap, // Thread mapping
cutlass::AlignedArray<cutlass::half_t, 8> // 128-bit loads
>;
// Instantiate with tensor reference
IteratorQ iterator(
{layout, query_ptr}, // TensorRef
{num_queries, head_dim}, // Extent
thread_id, // Thread ID
{query_offset, 0} // Thread offset
);
// Load tile with automatic residual handling
typename IteratorQ::Fragment fragment;
iterator.load(fragment); // Predicates applied only for residual tiles