Principle:Deepspeedai DeepSpeed CPU SIMD Optimizers

Knowledge Sources	DeepSpeed ZeRO-Offload: Democratizing Billion-Scale Model Training
Domains	Optimization, SIMD_Computing, CPU_Offloading
Last Updated	2026-02-09 00:00 GMT

Overview

High-throughput CPU and XPU optimizer kernels that use SIMD vectorization to accelerate parameter updates during ZeRO-Offload and ZeRO-Infinity training.

Description

CPU SIMD Optimizers provide hardware-optimized implementations of popular deep learning optimizers (Adam, AdamW, Adagrad, Lion, LAMB) that run on the CPU or Intel XPU rather than the GPU. These are essential for ZeRO-Offload (Stage 2) and ZeRO-Infinity (Stage 3), where optimizer states and parameter updates are offloaded to CPU memory to free GPU memory for larger models.

Naive CPU optimizer implementations are orders of magnitude slower than their GPU counterparts due to the lack of massive parallelism. DeepSpeed bridges this gap by:

SIMD vectorization: Using AVX-256 and AVX-512 intrinsics to process 8 or 16 float32 values per instruction cycle
Portable SIMD abstraction: A template-based abstraction layer that maps optimizer math to the widest available SIMD ISA at compile time
Multi-threaded execution: OpenMP parallelization across CPU cores for large parameter tensors
SYCL implementations: For Intel XPU offloading, SYCL kernels provide equivalent vectorized updates on Data Center GPU Max hardware
Fused operations: Multi-tensor apply patterns that process multiple parameter groups in a single pass, reducing memory bandwidth overhead

The CPU optimizer kernels are JIT-compiled via DeepSpeed's OpBuilder system and exposed to Python through pybind11 bindings.

Usage

Enable CPU offloading in the DeepSpeed configuration by setting offload_optimizer.device to "cpu" or "nvme". DeepSpeed will automatically route optimizer step() calls to the appropriate CPU SIMD kernel. For Intel XPU, ensure the XPU accelerator is active and the corresponding op builder is available.

Theoretical Basis

SIMD (Single Instruction, Multiple Data) processing enables a single CPU instruction to operate on a vector of values simultaneously. For optimizer updates, where the same arithmetic is applied element-wise to millions of parameters, SIMD provides near-linear speedup proportional to the vector width:

AVX-256: 8 float32 operations per instruction (256 bits / 32 bits)
AVX-512: 16 float32 operations per instruction (512 bits / 32 bits)

Adam update vectorized: For each parameter element i, the Adam update computes:

m_i = beta1 * m_i + (1 - beta1) * g_i
v_i = beta2 * v_i + (1 - beta2) * g_i^2
p_i = p_i - lr * m_i / (sqrt(v_i) + eps)

With AVX-512, all three steps process 16 elements simultaneously, achieving approximately 16x throughput over scalar code.

SIMD abstraction layer: The portable abstraction maps operations to the best available ISA:

// Abstract SIMD pattern for Adam update
template <typename SIMD_WIDTH>
void adam_step(float* params, float* grads, float* m, float* v,
               float lr, float beta1, float beta2, float eps, int size) {
    #pragma omp parallel for
    for (int i = 0; i < size; i += SIMD_WIDTH::value) {
        auto g = simd_load(grads + i);
        auto m_val = simd_load(m + i);
        auto v_val = simd_load(v + i);
        auto p_val = simd_load(params + i);

        m_val = simd_fmadd(simd_set(beta1), m_val,
                           simd_mul(simd_set(1 - beta1), g));
        v_val = simd_fmadd(simd_set(beta2), v_val,
                           simd_mul(simd_set(1 - beta2), simd_mul(g, g)));
        p_val = simd_sub(p_val,
                         simd_mul(simd_set(lr),
                                  simd_div(m_val,
                                           simd_add(simd_sqrt(v_val),
                                                    simd_set(eps)))));

        simd_store(params + i, p_val);
        simd_store(m + i, m_val);
        simd_store(v + i, v_val);
    }
}

Throughput model: With N CPU cores and SIMD width W, the theoretical throughput is N * W * clock_frequency operations per second, minus memory bandwidth limitations. In practice, CPU Adam achieves 40-80 GB/s effective bandwidth on modern server CPUs, sufficient to keep pace with GPU computation during ZeRO-Offload.

Related Pages

Implemented By

Implementation:Deepspeedai_DeepSpeed_CPU_Adam_Impl — AVX-vectorized Adam/AdamW optimizer kernel
Implementation:Deepspeedai_DeepSpeed_CPU_Adagrad — AVX-vectorized Adagrad optimizer kernel
Implementation:Deepspeedai_DeepSpeed_CPU_Adagrad_Header — Adagrad C++ header and class interface
Implementation:Deepspeedai_DeepSpeed_CPU_Adam_Header — Adam/AdamW C++ header and class interface
Implementation:Deepspeedai_DeepSpeed_CPU_Lion_Header — Lion optimizer C++ header and class interface
Implementation:Deepspeedai_DeepSpeed_CPU_Lion_Impl — AVX-vectorized Lion optimizer kernel
Implementation:Deepspeedai_DeepSpeed_Fused_LAMB_Frontend — Fused LAMB optimizer Python frontend
Implementation:Deepspeedai_DeepSpeed_XPU_Adagrad — SYCL-based Adagrad for Intel XPU
Implementation:Deepspeedai_DeepSpeed_XPU_Adam_SYCL — SYCL-based Adam for Intel XPU
Implementation:Deepspeedai_DeepSpeed_XPU_Multi_Tensor_Apply — Multi-tensor apply kernel for XPU
Implementation:Deepspeedai_DeepSpeed_XPU_Adam_Header — XPU Adam C++ header interface
Implementation:Deepspeedai_DeepSpeed_SIMD_Abstraction — Portable SIMD abstraction layer for AVX-256/AVX-512

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment