Implementation:Deepspeedai DeepSpeed CPU Lion Impl

Knowledge Sources	DeepSpeed
Domains	Optimization, Deep Learning, CPU Computing
Last Updated	2026-02-09 00:00 GMT

Overview

CPU implementation of the Lion (EvoLved Sign Momentum) optimizer with SIMD acceleration for memory-efficient neural network training on CPU hardware.

Description

This file provides the C++ implementation of the Lion optimizer with PyTorch bindings, featuring AVX2/AVX512 SIMD acceleration. Lion uses sign-based momentum updates, computing the sign of an interpolation between the current gradient and momentum, then applying it with the learning rate. This approach requires only one momentum buffer (vs. two for Adam), reducing memory usage by ~50% while maintaining competitive performance. The implementation uses hierarchical step functions with SIMD operations and portable sign manipulation via std::copysignf for the scalar fallback path.

Usage

Use this optimizer when training neural networks on CPU systems where memory efficiency is important, or when seeking an alternative to Adam with simpler hyperparameter tuning.

Code Reference

Source Location

Repository: DeepSpeed
File: csrc/lion/cpu_lion_impl.cpp

Signature

int create_lion_optimizer(int optimizer_id,
                          float alpha,
                          float betta1,
                          float betta2,
                          float weight_decay,
                          bool should_log);

int ds_lion_step(int optimizer_id,
                 size_t step,
                 float lr,
                 float beta1,
                 float beta2,
                 float weight_decay,
                 torch::Tensor& params,
                 torch::Tensor& grads,
                 torch::Tensor& exp_avg);

int destroy_lion_optimizer(int optimizer_id);

Import

#include "cpu_lion.h"

I/O Contract

create_lion_optimizer Parameters

Parameter	Type	Description
optimizer_id	int	Unique identifier for the optimizer instance
alpha	float	Learning rate (default: 1e-3)
betta1	float	Interpolation coefficient for update direction (default: 0.9)
betta2	float	Momentum decay rate for EMA (default: 0.999)
weight_decay	float	Weight decay coefficient (default: 0)
should_log	bool	Enable logging of optimizer creation

ds_lion_step Parameters

Parameter	Type	Description
optimizer_id	int	Optimizer instance identifier
step	size_t	Current training step number
lr	float	Current learning rate
beta1	float	Interpolation coefficient for update
beta2	float	Momentum decay rate
weight_decay	float	Weight decay coefficient
params	torch::Tensor&	Model parameters (in/out)
grads	torch::Tensor&	Gradients (in)
exp_avg	torch::Tensor&	Momentum buffer (in/out)

Returns

Function	Return Type	Description
create_lion_optimizer	int	0 on success
ds_lion_step	int	0 on success
destroy_lion_optimizer	int	0 on success

Usage Examples

import torch
import deepspeed

# Create Lion optimizer instance
optimizer_id = 0
deepspeed.ops.lion.cpu_lion.create_lion(
    optimizer_id,
    alpha=0.001,
    betta1=0.9,
    betta2=0.999,
    weight_decay=0.01,
    should_log=True
)

# Prepare tensors
params = torch.randn(1000, dtype=torch.float32)
grads = torch.randn(1000, dtype=torch.float32)
exp_avg = torch.zeros(1000, dtype=torch.float32)

# Perform optimizer step
deepspeed.ops.lion.cpu_lion.lion_update(
    optimizer_id,
    step=1,
    lr=0.001,
    beta1=0.9,
    beta2=0.999,
    weight_decay=0.01,
    params=params,
    grads=grads,
    exp_avg=exp_avg
)

# Cleanup
deepspeed.ops.lion.cpu_lion.destroy_lion(optimizer_id)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment