Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepspeedai DeepSpeed Evoformer Tile Iterator Atomic

From Leeroopedia


Knowledge Sources
Domains Attention, CUTLASS_Kernels, DeepSpeed4Science
Last Updated 2026-02-09 00:00 GMT

Overview

A predicated tile iterator that performs atomic accumulation operations for gradient aggregation in backward passes requiring concurrent updates from multiple threadblocks.

Description

PredicatedTileIteratorAtomic implements atomic addition operations for epilogue stores, enabling safe concurrent accumulation when multiple threadblocks contribute to the same output locations. This is essential for gradient computation in attention backward passes where bias gradients (dBias) are accumulated from multiple query/key positions. The iterator uses GPU atomic instructions (red.relaxed.global.add for float, red.relaxed.global.add.noftz.f16x2 for half precision) with inline PTX assembly to ensure memory consistency while maintaining reasonable performance. It supports both RowMajor layouts and affine rank-N layouts for flexible tensor organizations, with PredicatedTileIteratorAffineRankNAtomic handling higher-dimensional tensor indexing patterns.

Usage

This iterator is used exclusively in the backward pass gradient accumulation, particularly for computing bias gradients where contributions from multiple attention blocks must be atomically summed to avoid race conditions.

Code Reference

Source Location

Signature

// Atomic store helper
template <class AccessType, class Enable = void>
struct atomic_store {};

template <class AccessType>
struct atomic_store<AccessType, /* enable_if half_t */> {
    CUTLASS_DEVICE
    atomic_store(AccessType const& D, void* ptr, bool pred_guard);
};

// Atomic tile iterator for standard layouts
template <typename ThreadMap_,
          typename Element_,
          bool ScatterD = false,
          typename PermuteDLayout = layout::NoPermute>
class PredicatedTileIteratorAtomic {
    using Fragment = Array<Element, kCount>;
    using AccessType = AlignedArray<Element, kElementsPerAccess>;

    CUTLASS_DEVICE void store(Fragment const& fragment);
};

// Atomic tile iterator for affine rank-N layouts
template <typename ThreadMap_, typename Element_, int Rank>
class PredicatedTileIteratorAffineRankNAtomic {
    using Layout = layout::AffineRankN<Rank>;
    using Fragment = Array<Element, kCount>;

    CUTLASS_DEVICE void store(Fragment const& fragment);
};

Import

#include "csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_iterator_atomic.h"

I/O Contract

Parameter Type Description
Inputs
fragment Fragment Tile data to atomically accumulate
ptr Element* Global memory pointer for atomic operations
pred_guard bool Predicate for bounds checking
Outputs
global memory Element* Atomically updated values
Configuration
Element half_t/float Supported atomic types (FP16/FP32)
ScatterD bool Enable scatter pattern

Usage Examples

// Configure atomic iterator for bias gradient accumulation
using BiasGradIterator = cutlass::epilogue::threadblock::PredicatedTileIteratorAtomic<
    OutputTileThreadMap,
    cutlass::half_t,          // FP16 gradients
    false,                    // No scatter
    cutlass::layout::NoPermute
>;

// Accumulate bias gradients atomically
BiasGradIterator iterator(params, dbias_ptr, extent, thread_id);

// Each threadblock adds its contribution
typename BiasGradIterator::Fragment bias_grad_fragment;
compute_bias_gradient(bias_grad_fragment);  // Local computation

// Atomic store to global memory
iterator.store(bias_grad_fragment);  // Uses atomic add operations

// Result: dbias[i] = sum over all blocks of local_dbias[i]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment