Principle:Deepspeedai DeepSpeed CPU Communication Backends
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Communication, Shared_Memory, CPU_Computing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
High-performance CPU-based distributed communication primitives using shared memory and vendor-optimized collective libraries for intra-node and inter-node communication.
Description
CPU Communication Backends provide optimized distributed communication primitives for CPU-only training, CPU offloading scenarios, and hybrid CPU/GPU training where CPU-side reductions are needed. These backends complement GPU-side communication libraries (NCCL, RCCL) by handling cases where data resides in CPU memory.
The subsystem includes two primary communication strategies:
- Shared Memory (SHM) Allreduce: A high-performance intra-node allreduce that bypasses the network stack entirely. Processes on the same node exchange data through shared memory segments, using SIMD-vectorized reduction kernels to sum gradient buffers in-place. This avoids the overhead of serialization, network protocol processing, and kernel-to-user space copies that socket-based communication would require.
- CCL Backend: Integration with Intel's oneAPI Collective Communications Library (oneCCL), which provides optimized multi-node communication for Intel hardware. CCL supports AllReduce, AllGather, ReduceScatter, and other collectives, automatically selecting the best algorithm (ring, recursive doubling, tree) based on message size and topology.
Architecture-specific implementations maximize throughput on different CPU platforms:
- x86-64 (AVX-512): Uses 512-bit vector instructions for the reduction kernel, processing 16 float32 values per instruction
- RISC-V (RVV): Uses RISC-V Vector Extension intrinsics for the reduction kernel, supporting scalable vector lengths
Usage
For CPU-only training or CPU offloading, set the communication backend to "ccl" in the DeepSpeed configuration when using Intel hardware, or rely on the automatic shared-memory allreduce for intra-node gradient reduction. The SHM backend is activated automatically when processes on the same node need to perform CPU-side allreduce operations.
Theoretical Basis
Shared-memory allreduce exploits the fact that processes on the same physical node share access to the same physical memory through OS-managed shared memory segments. This eliminates all network stack overhead:
- Socket/TCP allreduce: Application buffer -> kernel buffer -> TCP/IP stack -> loopback -> TCP/IP stack -> kernel buffer -> application buffer (multiple copies, protocol overhead)
- SHM allreduce: Application buffer -> shared memory segment -> SIMD reduce -> application buffer (single copy, zero protocol overhead)
Bandwidth comparison:
- Loopback TCP: ~10-20 GB/s (limited by kernel overhead)
- Shared memory + SIMD: ~50-100 GB/s (limited by memory bandwidth)
Reduction algorithm: The SHM allreduce uses a flat reduce pattern where one process reads all peers' contributions from shared memory and performs element-wise reduction using SIMD:
// Abstract SHM allreduce pattern
void shm_allreduce(float* data, int count, int num_ranks) {
// Phase 1: Each rank copies its data to its shared memory slot
memcpy(shm_slots[my_rank], data, count * sizeof(float));
barrier(); // Ensure all ranks have written
// Phase 2: SIMD-vectorized reduction across all ranks
#pragma omp parallel for
for (int i = 0; i < count; i += SIMD_WIDTH) {
auto sum = simd_load(shm_slots[0] + i);
for (int r = 1; r < num_ranks; r++) {
sum = simd_add(sum, simd_load(shm_slots[r] + i));
}
simd_store(data + i, sum);
}
barrier(); // Ensure reduction is complete before next use
}
CCL algorithm selection: oneCCL automatically selects the optimal collective algorithm based on message size and cluster topology. For small messages, recursive doubling or recursive halving minimizes latency. For large messages, ring-based algorithms maximize bandwidth utilization.
Related Pages
Implemented By
- Implementation:Deepspeedai_DeepSpeed_CCL_Backend — Intel oneCCL integration for multi-node CPU communication
- Implementation:Deepspeedai_DeepSpeed_SHM_Allreduce — SIMD-vectorized shared-memory allreduce for intra-node reduction
- Implementation:Deepspeedai_DeepSpeed_RISCV_SHM — RISC-V Vector Extension implementation of the SHM reduction kernel
- Implementation:Deepspeedai_DeepSpeed_SHM_Interface — Shared memory segment creation, attachment, and lifecycle management