Environment:Deepspeedai DeepSpeed CPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, CPU_Optimization, SIMD |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
CPU compute environment for DeepSpeed's SIMD-optimized operators, communication backends, and CPU offloading operations.
Description
This environment provides the CPU compute context required by DeepSpeed's native C++ operators that run on the CPU. These include SIMD-optimized implementations of Adam, AdamW, Adagrad, and Lion optimizers, shared memory (SHM) based allreduce for inter-process communication, and the OneCCL communication backend. The CPU environment requires a modern x86-64 processor with AVX2 or AVX-512 instruction set support, a C++14-compatible compiler for JIT compilation, and appropriate threading libraries.
The SIMD abstraction layer (`csrc/includes/simd.h`) automatically selects the optimal instruction set at compile time: AVX-512 when available, falling back to AVX2. For ARM architectures, NEON intrinsics are used as a fallback.
Usage
Use this environment when running DeepSpeed CPU-offloaded training (ZeRO-Offload), CPU-based distributed training with OneCCL, or when using DeepSpeed's fused CPU optimizers for parameter updates during offloading.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Primary platform; Windows has limited support |
| CPU | x86-64 with AVX2 or AVX-512 | AVX-512 preferred for best performance; ARM with NEON also supported |
| Compiler | GCC >= 7.0 or compatible C++14 compiler | Required for JIT compilation of CPU ops |
| Shared Memory | /dev/shm >= 512MB | Required for SHM-based allreduce; Docker needs `--shm-size` |
| Threading | OpenMP support | Used for parallel SIMD operations across CPU cores |
Dependencies
System Packages
- `gcc` / `g++` (C++14 support)
- `libomp-dev` (OpenMP for threading)
- `ninja` (optional, for faster JIT compilation)
Python Packages
- `torch` (CPU build sufficient)
- `deepspeed`
Optional Packages
- `oneccl_bind_pt` (Intel OneCCL bindings for PyTorch) - for CCL backend
- `intel_extension_for_pytorch` (IPEX) - for enhanced Intel CPU performance
Credentials
The following environment variables affect CPU operations:
- `DS_ACCELERATOR=cpu`: Force CPU accelerator backend
- `OMP_NUM_THREADS`: Control OpenMP thread count for SIMD operations
- `CCL_WORKER_COUNT`: Number of OneCCL worker threads
- `KMP_AFFINITY`: Intel thread affinity settings
Quick Install
# Ensure compiler is available
sudo apt-get install gcc g++ libomp-dev
# Install DeepSpeed (CPU ops are JIT compiled)
pip install deepspeed
# Verify CPU op support
ds_report
Code Evidence
SIMD abstraction from `csrc/includes/simd.h`:
#if defined(__AVX512__)
#define SIMD_WIDTH 16
#define SIMD_LOAD(x) _mm512_load_ps(x)
#define SIMD_STORE(x, y) _mm512_store_ps(x, y)
#elif defined(__AVX256__)
#define SIMD_WIDTH 8
#define SIMD_LOAD(x) _mm256_load_ps(x)
#define SIMD_STORE(x, y) _mm256_store_ps(x, y)
#endif
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `cpu_adam not found` | CPU Adam op not compiled | Ensure gcc/g++ is installed; run `ds_report` to check |
| `AVX instruction not supported` | CPU lacks required SIMD instructions | Requires x86-64 with AVX2 minimum |
| `/dev/shm too small` | Insufficient shared memory for SHM allreduce | Use `--shm-size='1gb'` in Docker |
| `OneCCL not found` | oneccl_bind_pt not installed | `pip install oneccl_bind_pt` for Intel CCL support |
Related Pages
- Implementation:Deepspeedai_DeepSpeed_CPU_Adam_Impl
- Implementation:Deepspeedai_DeepSpeed_CPU_Adagrad
- Implementation:Deepspeedai_DeepSpeed_CPU_Adagrad_Header
- Implementation:Deepspeedai_DeepSpeed_CPU_Adam_Header
- Implementation:Deepspeedai_DeepSpeed_CPU_Lion_Header
- Implementation:Deepspeedai_DeepSpeed_CPU_Lion_Impl
- Implementation:Deepspeedai_DeepSpeed_SIMD_Abstraction
- Implementation:Deepspeedai_DeepSpeed_CCL_Backend
- Implementation:Deepspeedai_DeepSpeed_SHM_Allreduce
- Implementation:Deepspeedai_DeepSpeed_SHM_Interface