Implementation:Deepspeedai DeepSpeed XPU Adam SYCL
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep Learning, XPU Computing, SYCL |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
SYCL-based Adam optimizer kernel for Intel XPU devices with multi-tensor processing capabilities.
Description
This file implements the Adam optimizer using SYCL (Data Parallel C++) for Intel XPU acceleration, adapted from NVIDIA's apex fused Adam implementation. It features the AdamFunctor, a device-callable operator that processes parameters in vectorized chunks (ILP=4) for maximum throughput. The implementation supports both Adam (L2 regularization, mode 0) and AdamW (decoupled weight decay, mode 1) modes, with bias correction and gradient scaling for mixed precision training. It integrates with the multi-tensor apply framework for efficient batch processing across multiple tensor groups.
Usage
Use this optimizer when training neural networks on Intel XPU devices, particularly when processing multiple parameter tensors simultaneously for maximum GPU utilization.
Code Reference
Source Location
- Repository: DeepSpeed
- File: csrc/xpu/adam/multi_tensor_adam.dp.cpp
Signature
typedef enum : int {
ADAM_MODE_0 = 0, // L2 regularization mode
ADAM_MODE_1 = 1 // Decoupled weight decay mode(AdamW)
} adamMode_t;
template <typename T>
struct AdamFunctor {
void operator()(int chunk_size,
volatile int* noop_gmem,
TensorListMetadata<4>& tl,
const float beta1,
const float beta2,
const float beta1_correction,
const float beta2_correction,
const float epsilon,
const float lr,
adamMode_t mode,
const float decay);
};
void multi_tensor_adam_cuda(int chunk_size,
at::Tensor noop_flag,
std::vector<std::vector<at::Tensor>> tensor_lists,
const float lr,
const float beta1,
const float beta2,
const float epsilon,
const int step,
const int mode,
const int bias_correction,
const float weight_decay);
Import
#include <ATen/ATen.h>
#include <sycl/sycl.hpp>
#include "multi_tensor_apply.dp.hpp"
#include "type_shim.h"
I/O Contract
multi_tensor_adam_cuda Parameters
| Parameter | Type | Description |
|---|---|---|
| chunk_size | int | Size of chunks for processing (typically 65536) |
| noop_flag | at::Tensor | Flag to skip processing if non-zero |
| tensor_lists | std::vector<std::vector<at::Tensor>> | List of 4 tensor groups: [grads, params, m, v] |
| lr | float | Learning rate |
| beta1 | float | Exponential decay rate for first moment |
| beta2 | float | Exponential decay rate for second moment |
| epsilon | float | Small constant for numerical stability |
| step | int | Current training step number |
| mode | int | Optimizer mode (0: Adam, 1: AdamW) |
| bias_correction | int | Enable bias correction (0 or 1) |
| weight_decay | float | Weight decay coefficient |
Tensor List Structure
| Index | Tensor Group | Description |
|---|---|---|
| 0 | Gradients | List of gradient tensors |
| 1 | Parameters | List of parameter tensors (updated in-place) |
| 2 | First Moments | List of first moment estimate tensors |
| 3 | Second Moments | List of second moment estimate tensors |
Supported Data Types
| Type | Description |
|---|---|
| float | FP32 (single precision) |
| c10::Half | FP16 (half precision) |
| c10::BFloat16 | BFloat16 (brain floating point) |
Usage Examples
import torch
import deepspeed
# Prepare multiple parameter groups on XPU
params_list = [torch.randn(1024, device='xpu') for _ in range(4)]
grads_list = [torch.randn_like(p) for p in params_list]
m_list = [torch.zeros_like(p) for p in params_list]
v_list = [torch.zeros_like(p) for p in params_list]
# Create tensor lists: [grads, params, m, v]
tensor_lists = [grads_list, params_list, m_list, v_list]
# Noop flag (usually zero)
noop_flag = torch.zeros(1, dtype=torch.int32, device='xpu')
# Execute multi-tensor Adam step
deepspeed.ops.adam.multi_tensor_adam(
chunk_size=65536,
noop_flag=noop_flag,
tensor_lists=tensor_lists,
lr=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-8,
step=1,
mode=1, # AdamW mode
bias_correction=1,
weight_decay=0.01
)