Implementation:Deepspeedai DeepSpeed CPU Adam Impl

Knowledge Sources	DeepSpeed
Domains	Optimization, Deep Learning, CPU Computing
Last Updated	2026-02-09 00:00 GMT

Overview

CPU-optimized implementation of the Adam and AdamW optimizer with SIMD acceleration for efficient neural network training on CPU hardware.

Description

This file provides the C++ implementation of the Adam optimizer with PyTorch bindings, featuring AVX2/AVX512 SIMD acceleration for high-performance training on CPU. It implements both Adam (with L2 regularization) and AdamW (decoupled weight decay) modes, supporting multiple precision types including FP32, FP16, and BFloat16. The implementation includes hierarchical step functions (Step_1, Step_4, Step_8) that progressively handle larger batches with SIMD operations, falling back to scalar operations for remaining elements. A unique rollback feature allows reverting optimizer steps for advanced training scenarios.

Usage

Use this optimizer when training neural networks on CPU-only systems or when CPU offloading is required for memory efficiency in large model training scenarios.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/adam/cpu_adam_impl.cpp

Signature

int create_adam_optimizer(int optimizer_id,
                          float alpha,
                          float betta1,
                          float betta2,
                          float eps,
                          float weight_decay,
                          bool adamw_mode,
                          bool should_log);

int ds_adam_step(int optimizer_id,
                 size_t step,
                 float lr,
                 float beta1,
                 float beta2,
                 float epsilon,
                 float weight_decay,
                 bool bias_correction,
                 torch::Tensor& params,
                 torch::Tensor& grads,
                 torch::Tensor& exp_avg,
                 torch::Tensor& exp_avg_sq);

int ds_adam_rollback(int optimizer_id,
                     size_t step,
                     float lr,
                     float beta1,
                     float beta2,
                     float epsilon,
                     float weight_decay,
                     bool bias_correction,
                     torch::Tensor& params,
                     torch::Tensor& grads,
                     torch::Tensor& exp_avg,
                     torch::Tensor& exp_avg_sq);

int destroy_adam_optimizer(int optimizer_id);

Import

#include "cpu_adam.h"

I/O Contract

create_adam_optimizer Parameters

Parameter	Type	Description
optimizer_id	int	Unique identifier for the optimizer instance
alpha	float	Learning rate (default: 1e-3)
betta1	float	Exponential decay rate for first moment (default: 0.9)
betta2	float	Exponential decay rate for second moment (default: 0.999)
eps	float	Small constant for numerical stability (default: 1e-8)
weight_decay	float	Weight decay coefficient (default: 0)
adamw_mode	bool	Use AdamW (decoupled weight decay) if true, Adam if false
should_log	bool	Enable logging of optimizer creation

ds_adam_step Parameters

Parameter	Type	Description
optimizer_id	int	Optimizer instance identifier
step	size_t	Current training step number
lr	float	Current learning rate
beta1	float	First moment decay rate
beta2	float	Second moment decay rate
epsilon	float	Numerical stability constant
weight_decay	float	Weight decay coefficient
bias_correction	bool	Apply bias correction
params	torch::Tensor&	Model parameters (in/out)
grads	torch::Tensor&	Gradients (in)
exp_avg	torch::Tensor&	First moment estimates (in/out)
exp_avg_sq	torch::Tensor&	Second moment estimates (in/out)

Returns

Function	Return Type	Description
create_adam_optimizer	int	0 on success
ds_adam_step	int	0 on success
ds_adam_rollback	int	0 on success, -1 on error
destroy_adam_optimizer	int	0 on success

Usage Examples

import torch
import deepspeed

# Create optimizer instance
optimizer_id = 0
deepspeed.ops.adam.cpu_adam.create_adam(
    optimizer_id,
    lr=0.001,
    betta1=0.9,
    betta2=0.999,
    eps=1e-8,
    weight_decay=0.01,
    adamw_mode=True,
    should_log=True
)

# Prepare tensors
params = torch.randn(1000, dtype=torch.float32)
grads = torch.randn(1000, dtype=torch.float32)
exp_avg = torch.zeros(1000, dtype=torch.float32)
exp_avg_sq = torch.zeros(1000, dtype=torch.float32)

# Perform optimizer step
deepspeed.ops.adam.cpu_adam.adam_update(
    optimizer_id,
    step=1,
    lr=0.001,
    beta1=0.9,
    beta2=0.999,
    epsilon=1e-8,
    weight_decay=0.01,
    bias_correction=True,
    params=params,
    grads=grads,
    exp_avg=exp_avg,
    exp_avg_sq=exp_avg_sq
)

# Cleanup
deepspeed.ops.adam.cpu_adam.destroy_adam(optimizer_id)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment