Implementation:Deepspeedai DeepSpeed Evoformer Tile Iterator Atomic
| Knowledge Sources | |
|---|---|
| Domains | Attention, CUTLASS_Kernels, DeepSpeed4Science |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A predicated tile iterator that performs atomic accumulation operations for gradient aggregation in backward passes requiring concurrent updates from multiple threadblocks.
Description
PredicatedTileIteratorAtomic implements atomic addition operations for epilogue stores, enabling safe concurrent accumulation when multiple threadblocks contribute to the same output locations. This is essential for gradient computation in attention backward passes where bias gradients (dBias) are accumulated from multiple query/key positions. The iterator uses GPU atomic instructions (red.relaxed.global.add for float, red.relaxed.global.add.noftz.f16x2 for half precision) with inline PTX assembly to ensure memory consistency while maintaining reasonable performance. It supports both RowMajor layouts and affine rank-N layouts for flexible tensor organizations, with PredicatedTileIteratorAffineRankNAtomic handling higher-dimensional tensor indexing patterns.
Usage
This iterator is used exclusively in the backward pass gradient accumulation, particularly for computing bias gradients where contributions from multiple attention blocks must be atomically summed to avoid race conditions.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_iterator_atomic.h
Signature
// Atomic store helper
template <class AccessType, class Enable = void>
struct atomic_store {};
template <class AccessType>
struct atomic_store<AccessType, /* enable_if half_t */> {
CUTLASS_DEVICE
atomic_store(AccessType const& D, void* ptr, bool pred_guard);
};
// Atomic tile iterator for standard layouts
template <typename ThreadMap_,
typename Element_,
bool ScatterD = false,
typename PermuteDLayout = layout::NoPermute>
class PredicatedTileIteratorAtomic {
using Fragment = Array<Element, kCount>;
using AccessType = AlignedArray<Element, kElementsPerAccess>;
CUTLASS_DEVICE void store(Fragment const& fragment);
};
// Atomic tile iterator for affine rank-N layouts
template <typename ThreadMap_, typename Element_, int Rank>
class PredicatedTileIteratorAffineRankNAtomic {
using Layout = layout::AffineRankN<Rank>;
using Fragment = Array<Element, kCount>;
CUTLASS_DEVICE void store(Fragment const& fragment);
};
Import
#include "csrc/deepspeed4science/evoformer_attn/iterators/predicated_tile_iterator_atomic.h"
I/O Contract
| Parameter | Type | Description |
|---|---|---|
| Inputs | ||
| fragment | Fragment | Tile data to atomically accumulate |
| ptr | Element* | Global memory pointer for atomic operations |
| pred_guard | bool | Predicate for bounds checking |
| Outputs | ||
| global memory | Element* | Atomically updated values |
| Configuration | ||
| Element | half_t/float | Supported atomic types (FP16/FP32) |
| ScatterD | bool | Enable scatter pattern |
Usage Examples
// Configure atomic iterator for bias gradient accumulation
using BiasGradIterator = cutlass::epilogue::threadblock::PredicatedTileIteratorAtomic<
OutputTileThreadMap,
cutlass::half_t, // FP16 gradients
false, // No scatter
cutlass::layout::NoPermute
>;
// Accumulate bias gradients atomically
BiasGradIterator iterator(params, dbias_ptr, extent, thread_id);
// Each threadblock adds its contribution
typename BiasGradIterator::Fragment bias_grad_fragment;
compute_bias_gradient(bias_grad_fragment); // Local computation
// Atomic store to global memory
iterator.store(bias_grad_fragment); // Uses atomic add operations
// Result: dbias[i] = sum over all blocks of local_dbias[i]