Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepspeedai DeepSpeed XPU Adam SYCL

From Leeroopedia


Knowledge Sources
Domains Optimization, Deep Learning, XPU Computing, SYCL
Last Updated 2026-02-09 00:00 GMT

Overview

SYCL-based Adam optimizer kernel for Intel XPU devices with multi-tensor processing capabilities.

Description

This file implements the Adam optimizer using SYCL (Data Parallel C++) for Intel XPU acceleration, adapted from NVIDIA's apex fused Adam implementation. It features the AdamFunctor, a device-callable operator that processes parameters in vectorized chunks (ILP=4) for maximum throughput. The implementation supports both Adam (L2 regularization, mode 0) and AdamW (decoupled weight decay, mode 1) modes, with bias correction and gradient scaling for mixed precision training. It integrates with the multi-tensor apply framework for efficient batch processing across multiple tensor groups.

Usage

Use this optimizer when training neural networks on Intel XPU devices, particularly when processing multiple parameter tensors simultaneously for maximum GPU utilization.

Code Reference

Source Location

Signature

typedef enum : int {
    ADAM_MODE_0 = 0,  // L2 regularization mode
    ADAM_MODE_1 = 1   // Decoupled weight decay mode(AdamW)
} adamMode_t;

template <typename T>
struct AdamFunctor {
    void operator()(int chunk_size,
                    volatile int* noop_gmem,
                    TensorListMetadata<4>& tl,
                    const float beta1,
                    const float beta2,
                    const float beta1_correction,
                    const float beta2_correction,
                    const float epsilon,
                    const float lr,
                    adamMode_t mode,
                    const float decay);
};

void multi_tensor_adam_cuda(int chunk_size,
                            at::Tensor noop_flag,
                            std::vector<std::vector<at::Tensor>> tensor_lists,
                            const float lr,
                            const float beta1,
                            const float beta2,
                            const float epsilon,
                            const int step,
                            const int mode,
                            const int bias_correction,
                            const float weight_decay);

Import

#include <ATen/ATen.h>
#include <sycl/sycl.hpp>
#include "multi_tensor_apply.dp.hpp"
#include "type_shim.h"

I/O Contract

multi_tensor_adam_cuda Parameters

Parameter Type Description
chunk_size int Size of chunks for processing (typically 65536)
noop_flag at::Tensor Flag to skip processing if non-zero
tensor_lists std::vector<std::vector<at::Tensor>> List of 4 tensor groups: [grads, params, m, v]
lr float Learning rate
beta1 float Exponential decay rate for first moment
beta2 float Exponential decay rate for second moment
epsilon float Small constant for numerical stability
step int Current training step number
mode int Optimizer mode (0: Adam, 1: AdamW)
bias_correction int Enable bias correction (0 or 1)
weight_decay float Weight decay coefficient

Tensor List Structure

Index Tensor Group Description
0 Gradients List of gradient tensors
1 Parameters List of parameter tensors (updated in-place)
2 First Moments List of first moment estimate tensors
3 Second Moments List of second moment estimate tensors

Supported Data Types

Type Description
float FP32 (single precision)
c10::Half FP16 (half precision)
c10::BFloat16 BFloat16 (brain floating point)

Usage Examples

import torch
import deepspeed

# Prepare multiple parameter groups on XPU
params_list = [torch.randn(1024, device='xpu') for _ in range(4)]
grads_list = [torch.randn_like(p) for p in params_list]
m_list = [torch.zeros_like(p) for p in params_list]
v_list = [torch.zeros_like(p) for p in params_list]

# Create tensor lists: [grads, params, m, v]
tensor_lists = [grads_list, params_list, m_list, v_list]

# Noop flag (usually zero)
noop_flag = torch.zeros(1, dtype=torch.int32, device='xpu')

# Execute multi-tensor Adam step
deepspeed.ops.adam.multi_tensor_adam(
    chunk_size=65536,
    noop_flag=noop_flag,
    tensor_lists=tensor_lists,
    lr=0.001,
    beta1=0.9,
    beta2=0.999,
    epsilon=1e-8,
    step=1,
    mode=1,  # AdamW mode
    bias_correction=1,
    weight_decay=0.01
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment