Environment:Deepspeedai DeepSpeed CPU Environment

Knowledge Sources	DeepSpeed
Domains	Infrastructure, CPU_Optimization, SIMD
Last Updated	2026-02-09 00:00 GMT

Overview

CPU compute environment for DeepSpeed's SIMD-optimized operators, communication backends, and CPU offloading operations.

Description

This environment provides the CPU compute context required by DeepSpeed's native C++ operators that run on the CPU. These include SIMD-optimized implementations of Adam, AdamW, Adagrad, and Lion optimizers, shared memory (SHM) based allreduce for inter-process communication, and the OneCCL communication backend. The CPU environment requires a modern x86-64 processor with AVX2 or AVX-512 instruction set support, a C++14-compatible compiler for JIT compilation, and appropriate threading libraries.

The SIMD abstraction layer (`csrc/includes/simd.h`) automatically selects the optimal instruction set at compile time: AVX-512 when available, falling back to AVX2. For ARM architectures, NEON intrinsics are used as a fallback.

Usage

Use this environment when running DeepSpeed CPU-offloaded training (ZeRO-Offload), CPU-based distributed training with OneCCL, or when using DeepSpeed's fused CPU optimizers for parameter updates during offloading.

System Requirements

Category	Requirement	Notes
OS	Linux	Primary platform; Windows has limited support
CPU	x86-64 with AVX2 or AVX-512	AVX-512 preferred for best performance; ARM with NEON also supported
Compiler	GCC >= 7.0 or compatible C++14 compiler	Required for JIT compilation of CPU ops
Shared Memory	/dev/shm >= 512MB	Required for SHM-based allreduce; Docker needs `--shm-size`
Threading	OpenMP support	Used for parallel SIMD operations across CPU cores

Dependencies

System Packages

`gcc` / `g++` (C++14 support)
`libomp-dev` (OpenMP for threading)
`ninja` (optional, for faster JIT compilation)

Python Packages

`torch` (CPU build sufficient)
`deepspeed`

Optional Packages

`oneccl_bind_pt` (Intel OneCCL bindings for PyTorch) - for CCL backend
`intel_extension_for_pytorch` (IPEX) - for enhanced Intel CPU performance

Credentials

The following environment variables affect CPU operations:

`DS_ACCELERATOR=cpu`: Force CPU accelerator backend
`OMP_NUM_THREADS`: Control OpenMP thread count for SIMD operations
`CCL_WORKER_COUNT`: Number of OneCCL worker threads
`KMP_AFFINITY`: Intel thread affinity settings

Quick Install

# Ensure compiler is available
sudo apt-get install gcc g++ libomp-dev

# Install DeepSpeed (CPU ops are JIT compiled)
pip install deepspeed

# Verify CPU op support
ds_report

Code Evidence

SIMD abstraction from `csrc/includes/simd.h`:

#if defined(__AVX512__)
#define SIMD_WIDTH 16
#define SIMD_LOAD(x) _mm512_load_ps(x)
#define SIMD_STORE(x, y) _mm512_store_ps(x, y)
#elif defined(__AVX256__)
#define SIMD_WIDTH 8
#define SIMD_LOAD(x) _mm256_load_ps(x)
#define SIMD_STORE(x, y) _mm256_store_ps(x, y)
#endif

Common Errors

Error Message	Cause	Solution
`cpu_adam not found`	CPU Adam op not compiled	Ensure gcc/g++ is installed; run `ds_report` to check
`AVX instruction not supported`	CPU lacks required SIMD instructions	Requires x86-64 with AVX2 minimum
`/dev/shm too small`	Insufficient shared memory for SHM allreduce	Use `--shm-size='1gb'` in Docker
`OneCCL not found`	oneccl_bind_pt not installed	`pip install oneccl_bind_pt` for Intel CCL support

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment