Principle:FMInference FlexLLMGen CUDA Parallel Reduction

Knowledge Sources	FMInference_FlexLLMGen
Domains	CUDA, Parallel Computing, GPU Architecture
Last Updated	2026-02-09 12:00 GMT

Overview

Parallel reduction on GPUs computes aggregate values (sum, max, min) across threads within warps and thread blocks using a hierarchical approach of warp shuffles and shared memory exchanges.

Description

Parallel reduction is a fundamental GPU computing pattern that combines values held by individual threads into a single aggregate result. On NVIDIA GPUs, this is implemented as a two-level hierarchy:

Warp-level reduction exploits the lockstep execution of 32 threads in a warp. Using the __shfl_xor_sync intrinsic (exposed as warp.shfl_xor() via cooperative groups), threads exchange register values directly without shared memory. A butterfly reduction pattern with log2(32) = 5 iterations produces the warp-wide result:

For i = 1, 2, 4, 8, 16: each thread XORs its lane index with i and combines the received value using the reduction operator.

Block-level reduction extends warp reduction to thread blocks containing multiple warps. The algorithm has three phases:

Each warp performs an intra-warp reduction.
Lane 0 of each warp writes its partial result to a shared memory buffer.
After a __syncthreads() barrier, warp 0 loads all partial results and performs a final warp reduction.
The result is broadcast back to all threads via shared memory.

Partitioned block reduction divides a thread block into independent reduction groups. When partition size is less than or equal to a warp, only a sub-warp shuffle reduction is needed. For larger partitions, a modified block reduction is used with offset calculations to isolate each partition's shared memory region.

Multi-operand reduction performs multiple independent reductions (potentially with different operators) in the same pass. For example, computing both max and min simultaneously saves one synchronization barrier and one shared memory round-trip compared to performing them sequentially.

Usage

Apply this principle when implementing any GPU kernel that needs to aggregate per-thread values into a shared result, such as computing softmax normalization constants, quantization range parameters, or loss function reductions.

Theoretical Basis

The butterfly reduction pattern using XOR-based lane exchanges ensures that after k iterations, all threads whose lane indices differ only in the lowest k bits have the same partial result. After log2(W) iterations (W = warp size), all threads hold the complete reduction.

Work complexity is O(N) for N input elements with O(log W) steps per warp and O(log B) steps across B warps, giving total time complexity O(N/P + log P) for P threads.

Shared memory banking: The block-level reduction stores one float per warp in shared memory. With 32 banks and up to 32 warps, each warp maps to a distinct bank, avoiding bank conflicts during the cross-warp phase.

Identity elements are critical for correctness: 0 for addition, +infinity for min, -infinity for max. Threads beyond the active count are initialized to identity values so they do not affect the reduction result.

Half-precision support uses architecture-conditional compilation: Ampere and newer GPUs (sm_80+) provide native __hmax and __hmin intrinsics, while older architectures fall back to comparison-based implementations.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment