Implementation:Deepspeedai DeepSpeed XPU Accelerator

Knowledge Sources	DeepSpeed
Domains	Accelerator, Intel GPU Backend
Last Updated	2026-02-09 00:00 GMT

Overview

Intel XPU (GPU) accelerator backend enabling DeepSpeed training on Intel discrete and integrated GPUs.

Description

The XPU_Accelerator class implements the DeepSpeedAccelerator interface for Intel XPU (GPU) accelerators. It wraps torch.xpu APIs with optional intel_extension_for_pytorch integration and uses ccl (via oneccl_bindings_for_pytorch) or xccl as the communication backend with inductor for compilation. Contains several workarounds for XPU-specific behaviors: host timers are used for IPEX versions < 2.6 due to event issues, default_stream returns current_stream since torch.xpu doesn't support CUDA-style default stream sync behavior, and pin_memory supports aligned allocation via AsyncIOBuilder.aio_handle when align_bytes=0, tracking aligned tensors for is_pinned checks. Both FP16 and BF16 are supported. Uses ZE_AFFINITY_MASK for device visibility. Build extension uses DpcppBuildExtension from IPEX or torch.

Usage

Use when training on Intel GPU hardware (Arc, Data Center Max series). Requires torch.xpu support, optionally with intel_extension_for_pytorch. Set DS_ACCELERATOR=xpu to explicitly select this backend.

Code Reference

Source Location

Repository: DeepSpeed
File: accelerator/xpu_accelerator.py

Signature

class XPU_Accelerator(DeepSpeedAccelerator):
    def __init__(self):
        self._name = 'xpu'
        if oneccl_imported_p:
            self._communication_backend_name = 'ccl'
        else:
            self._communication_backend_name = 'xccl'
        self._compile_backend = "inductor"
        self.aligned_tensors = []
        self.class_dict = None

    def is_synchronized_device(self):
        return False

    def use_host_timers(self):
        # WA for XPU event issues in IPEX < 2.6
        if ipex.__version__ < '2.6':
            return True
        return self.is_synchronized_device()

    def default_stream(self, device_index=None):
        # torch.xpu doesn't support CUDA-style default stream sync
        return torch.xpu.current_stream(device_index)

    def pin_memory(self, tensor, align_bytes=1):
        if align_bytes == 1:
            return tensor.pin_memory(device=self.current_device_name())
        elif align_bytes == 0:
            # Use AsyncIOBuilder for aligned allocation
            self.aio_handle = AsyncIOBuilder().load().aio_handle(...)
            aligned_t = self.aio_handle.new_cpu_locked_tensor(...)
            self.aligned_tensors.append([aligned_t.data_ptr(), ...])
            return aligned_t

    def is_bf16_supported(self):
        return True

    def is_fp16_supported(self):
        return True

    def build_extension(self):
        from intel_extension_for_pytorch.xpu.cpp_extension import DpcppBuildExtension
        return DpcppBuildExtension

    def visible_devices_envs(self):
        return ['ZE_AFFINITY_MASK']

Import

from deepspeed.accelerator.xpu_accelerator import XPU_Accelerator

I/O Contract

Inputs

Name	Type	Required	Description
device_index	int	Optional	XPU device index
seed	int	Required	Random seed for XPU RNG
align_bytes	int	Optional	Alignment for pin_memory (1 or 0)

Outputs

Name	Type	Description
device	torch.device	XPU device object
device_count	int	Number of XPU devices
memory_bytes	int	XPU memory in bytes
communication_backend	str	'ccl' or 'xccl'

Usage Examples

# Set XPU accelerator
import os
os.environ['DS_ACCELERATOR'] = 'xpu'

from deepspeed.accelerator import get_accelerator
accelerator = get_accelerator()

print(f"Device: {accelerator.device_name()}")  # 'xpu'
print(f"Backend: {accelerator.communication_backend_name()}")  # 'ccl' or 'xccl'

# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
print(f"Current device: {accelerator.current_device_name()}")

# Precision support - both available
print(f"FP16: {accelerator.is_fp16_supported()}")  # True
print(f"BF16: {accelerator.is_bf16_supported()}")  # True
print(f"Triton: {accelerator.is_triton_supported()}")  # False

# Memory operations
total = accelerator.total_memory(0)
allocated = accelerator.memory_allocated(0)
available = accelerator.available_memory(0)

# Pin memory with alignment options
tensor = torch.randn(1000)
pinned_tensor = accelerator.pin_memory(tensor, align_bytes=1)
aligned_tensor = accelerator.pin_memory(tensor, align_bytes=0)

# Note: default_stream returns current_stream
default = accelerator.default_stream()
current = accelerator.current_stream()
print(f"Streams same: {default is current}")  # True (workaround)

# Synchronization
accelerator.synchronize()

Related Pages

Abstract Accelerator - Base interface
Real Accelerator - Accelerator selection
CUDA Accelerator - NVIDIA alternative

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment