Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Deepspeedai DeepSpeed CPU Communication Backends

From Leeroopedia


Knowledge Sources
Domains Distributed_Communication, Shared_Memory, CPU_Computing
Last Updated 2026-02-09 00:00 GMT

Overview

High-performance CPU-based distributed communication primitives using shared memory and vendor-optimized collective libraries for intra-node and inter-node communication.

Description

CPU Communication Backends provide optimized distributed communication primitives for CPU-only training, CPU offloading scenarios, and hybrid CPU/GPU training where CPU-side reductions are needed. These backends complement GPU-side communication libraries (NCCL, RCCL) by handling cases where data resides in CPU memory.

The subsystem includes two primary communication strategies:

  • Shared Memory (SHM) Allreduce: A high-performance intra-node allreduce that bypasses the network stack entirely. Processes on the same node exchange data through shared memory segments, using SIMD-vectorized reduction kernels to sum gradient buffers in-place. This avoids the overhead of serialization, network protocol processing, and kernel-to-user space copies that socket-based communication would require.
  • CCL Backend: Integration with Intel's oneAPI Collective Communications Library (oneCCL), which provides optimized multi-node communication for Intel hardware. CCL supports AllReduce, AllGather, ReduceScatter, and other collectives, automatically selecting the best algorithm (ring, recursive doubling, tree) based on message size and topology.

Architecture-specific implementations maximize throughput on different CPU platforms:

  • x86-64 (AVX-512): Uses 512-bit vector instructions for the reduction kernel, processing 16 float32 values per instruction
  • RISC-V (RVV): Uses RISC-V Vector Extension intrinsics for the reduction kernel, supporting scalable vector lengths

Usage

For CPU-only training or CPU offloading, set the communication backend to "ccl" in the DeepSpeed configuration when using Intel hardware, or rely on the automatic shared-memory allreduce for intra-node gradient reduction. The SHM backend is activated automatically when processes on the same node need to perform CPU-side allreduce operations.

Theoretical Basis

Shared-memory allreduce exploits the fact that processes on the same physical node share access to the same physical memory through OS-managed shared memory segments. This eliminates all network stack overhead:

  • Socket/TCP allreduce: Application buffer -> kernel buffer -> TCP/IP stack -> loopback -> TCP/IP stack -> kernel buffer -> application buffer (multiple copies, protocol overhead)
  • SHM allreduce: Application buffer -> shared memory segment -> SIMD reduce -> application buffer (single copy, zero protocol overhead)

Bandwidth comparison:

  • Loopback TCP: ~10-20 GB/s (limited by kernel overhead)
  • Shared memory + SIMD: ~50-100 GB/s (limited by memory bandwidth)

Reduction algorithm: The SHM allreduce uses a flat reduce pattern where one process reads all peers' contributions from shared memory and performs element-wise reduction using SIMD:

// Abstract SHM allreduce pattern
void shm_allreduce(float* data, int count, int num_ranks) {
    // Phase 1: Each rank copies its data to its shared memory slot
    memcpy(shm_slots[my_rank], data, count * sizeof(float));
    barrier();  // Ensure all ranks have written

    // Phase 2: SIMD-vectorized reduction across all ranks
    #pragma omp parallel for
    for (int i = 0; i < count; i += SIMD_WIDTH) {
        auto sum = simd_load(shm_slots[0] + i);
        for (int r = 1; r < num_ranks; r++) {
            sum = simd_add(sum, simd_load(shm_slots[r] + i));
        }
        simd_store(data + i, sum);
    }
    barrier();  // Ensure reduction is complete before next use
}

CCL algorithm selection: oneCCL automatically selects the optimal collective algorithm based on message size and cluster topology. For small messages, recursive doubling or recursive halving minimizes latency. For large messages, ring-based algorithms maximize bandwidth utilization.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment