Implementation:Deepspeedai DeepSpeed Evoformer Tile Iterator Atomic

Knowledge Sources	DeepSpeed Evoformer Attention
Domains	Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated	2026-02-09 00:00 GMT

Overview

A predicated tile iterator that performs atomic accumulation operations for gradient aggregation in backward passes requiring concurrent updates from multiple threadblocks.

Description

PredicatedTileIteratorAtomic implements atomic addition operations for epilogue stores, enabling safe concurrent accumulation when multiple threadblocks contribute to the same output locations. This is essential for gradient computation in attention backward passes where bias gradients (dBias) are accumulated from multiple query/key positions. The iterator uses GPU atomic instructions (red.relaxed.global.add for float, red.relaxed.global.add.noftz.f16x2 for half precision) with inline PTX assembly to ensure memory consistency while maintaining reasonable performance. It supports both RowMajor layouts and affine rank-N layouts for flexible tensor organizations, with PredicatedTileIteratorAffineRankNAtomic handling higher-dimensional tensor indexing patterns.

Usage

This iterator is used exclusively in the backward pass gradient accumulation, particularly for computing bias gradients where contributions from multiple attention blocks must be atomically summed to avoid race conditions.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_iterator_atomic.h

Signature

// Atomic store helper
template <class AccessType, class Enable = void>
struct atomic_store {};

template <class AccessType>
struct atomic_store<AccessType, /* enable_if half_t */> {
    CUTLASS_DEVICE
    atomic_store(AccessType const& D, void* ptr, bool pred_guard);
};

// Atomic tile iterator for standard layouts
template <typename ThreadMap_,
          typename Element_,
          bool ScatterD = false,
          typename PermuteDLayout = layout::NoPermute>
class PredicatedTileIteratorAtomic {
    using Fragment = Array<Element, kCount>;
    using AccessType = AlignedArray<Element, kElementsPerAccess>;

    CUTLASS_DEVICE void store(Fragment const& fragment);
};

// Atomic tile iterator for affine rank-N layouts
template <typename ThreadMap_, typename Element_, int Rank>
class PredicatedTileIteratorAffineRankNAtomic {
    using Layout = layout::AffineRankN<Rank>;
    using Fragment = Array<Element, kCount>;

    CUTLASS_DEVICE void store(Fragment const& fragment);
};

Import

#include "csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_iterator_atomic.h"

I/O Contract

Parameter	Type	Description
Inputs
fragment	Fragment	Tile data to atomically accumulate
ptr	Element*	Global memory pointer for atomic operations
pred_guard	bool	Predicate for bounds checking
Outputs
global memory	Element*	Atomically updated values
Configuration
Element	half_t/float	Supported atomic types (FP16/FP32)
ScatterD	bool	Enable scatter pattern

Usage Examples

// Configure atomic iterator for bias gradient accumulation
using BiasGradIterator = cutlass::epilogue::threadblock::PredicatedTileIteratorAtomic<
    OutputTileThreadMap,
    cutlass::half_t,          // FP16 gradients
    false,                    // No scatter
    cutlass::layout::NoPermute
>;

// Accumulate bias gradients atomically
BiasGradIterator iterator(params, dbias_ptr, extent, thread_id);

// Each threadblock adds its contribution
typename BiasGradIterator::Fragment bias_grad_fragment;
compute_bias_gradient(bias_grad_fragment);  // Local computation

// Atomic store to global memory
iterator.store(bias_grad_fragment);  // Uses atomic add operations

// Result: dbias[i] = sum over all blocks of local_dbias[i]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment