Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen DeepSpeed Reduction Utils

From Leeroopedia


Knowledge Sources
Domains CUDA, Parallel Computing, GPU Optimization
Last Updated 2026-02-09 12:00 GMT

Overview

A CUDA header-only library implementing composable warp-level and block-level parallel reduction primitives for Add, Max, and Min operations on float, __half, and __half2 data types.

Description

This file implements the reduce namespace containing GPU device-side reduction utilities organized in a layered architecture:

  • Element-level operations: Template-specialized element<ROpType>() functions implementing pairwise Add, Max, and Min for float, __half, and __half2 types. Uses CUDA architecture-specific intrinsics (__hmax, __hmin) on Ampere (sm_80+) and fallback comparisons on older architectures.
  • Initialization primitives: init<ROpType>() returns identity values for each operation (0 for Add, +INF for Min, -INF for Max) across float, __half, and __half2 types.
  • Warp reductions: _warp<Ops...>() performs butterfly reductions using shfl_xor intrinsics within a warp or warp partition, supporting 1 to 4 simultaneous independent reductions in a single pass.
  • Block reductions: _block<total_warps, Ops...>() extends warp reductions to full thread blocks using shared memory for inter-warp communication, with a two-phase approach: intra-warp reduction followed by cross-warp reduction.
  • Partitioned block reductions: partitioned_block<Ops..., num_threads>() enables sub-block reductions where warps are grouped into independent partitions, each performing its own coherent reduction.

A key design feature is the support for multi-operand reductions: up to 4 independent reduction operations (with potentially different operation types) can be performed simultaneously in a single pass through the data, amortizing synchronization overhead.

Usage

These reduction primitives are used internally by other DeepSpeed CUDA kernels, particularly the quantization utilities (for computing per-group min/max statistics) and softmax kernels (for computing row-wise max and sum).

Code Reference

Source Location

Signature

namespace reduce {

enum class ROpType { Add, Max, Min };

// Element-level reduction
template <ROpType Op, typename T>
DS_D_INLINE T element(const T lhs, const T rhs);

// Initialization
template <ROpType OType, typename T = float>
DS_D_INLINE T init();

// Block-level reduction (1-4 operands)
template <ROpType Op, int warp_bound = max_warps>
DS_D_INLINE void block(cg::thread_block& tb,
                        cg::thread_block_tile<hw_warp_size>& warp, float& val);

// Partitioned block reduction (sub-block scope)
template <ROpType Op, int num_threads>
DS_D_INLINE void partitioned_block(cg::thread_block& tb,
                                    cg::thread_block_tile<hw_warp_size>& warp, float& val);
}

Import

#include "reduction_utils.h"

I/O Contract

Inputs

Name Type Required Description
tb cg::thread_block Yes CUDA cooperative groups thread block handle.
warp cg::thread_block_tile<hw_warp_size> Yes CUDA cooperative groups warp tile handle.
val (val1, val2, ...) float& Yes Reference(s) to the value(s) to reduce. Modified in-place with the reduction result.
Op (template) ROpType Yes The reduction operation type: Add, Max, or Min.
num_threads (template) int Conditional Number of threads per partition (for partitioned_block). Must be a power of 2.

Outputs

Name Type Description
val (val1, val2, ...) float& The input reference(s) are modified in-place to contain the reduction result. After a block reduction, all threads in the block (or partition) hold the same reduced value.

Usage Examples

// Inside a CUDA kernel: perform simultaneous Max and Min block reductions
using rop = reduce::ROpType;

float max_val = reduce::init<rop::Max>();
float min_val = reduce::init<rop::Min>();

// Serial accumulation loop
for (int i = 0; i < num_elements; i++) {
    max_val = reduce::element<rop::Max>(max_val, data[i]);
    min_val = reduce::element<rop::Min>(min_val, data[i]);
}

// Parallel reduction across the thread block
cg::thread_block tb = cg::this_thread_block();
cg::thread_block_tile<hw_warp_size> warp = cg::tiled_partition<hw_warp_size>(tb);
reduce::block<rop::Max, rop::Min>(tb, warp, max_val, min_val);
// max_val and min_val now contain the block-wide max and min

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment