Implementation:Deepspeedai DeepSpeed XPU Accelerator
| Knowledge Sources | |
|---|---|
| Domains | Accelerator, Intel GPU Backend |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Intel XPU (GPU) accelerator backend enabling DeepSpeed training on Intel discrete and integrated GPUs.
Description
The XPU_Accelerator class implements the DeepSpeedAccelerator interface for Intel XPU (GPU) accelerators. It wraps torch.xpu APIs with optional intel_extension_for_pytorch integration and uses ccl (via oneccl_bindings_for_pytorch) or xccl as the communication backend with inductor for compilation. Contains several workarounds for XPU-specific behaviors: host timers are used for IPEX versions < 2.6 due to event issues, default_stream returns current_stream since torch.xpu doesn't support CUDA-style default stream sync behavior, and pin_memory supports aligned allocation via AsyncIOBuilder.aio_handle when align_bytes=0, tracking aligned tensors for is_pinned checks. Both FP16 and BF16 are supported. Uses ZE_AFFINITY_MASK for device visibility. Build extension uses DpcppBuildExtension from IPEX or torch.
Usage
Use when training on Intel GPU hardware (Arc, Data Center Max series). Requires torch.xpu support, optionally with intel_extension_for_pytorch. Set DS_ACCELERATOR=xpu to explicitly select this backend.
Code Reference
Source Location
- Repository: DeepSpeed
- File: accelerator/xpu_accelerator.py
Signature
class XPU_Accelerator(DeepSpeedAccelerator):
def __init__(self):
self._name = 'xpu'
if oneccl_imported_p:
self._communication_backend_name = 'ccl'
else:
self._communication_backend_name = 'xccl'
self._compile_backend = "inductor"
self.aligned_tensors = []
self.class_dict = None
def is_synchronized_device(self):
return False
def use_host_timers(self):
# WA for XPU event issues in IPEX < 2.6
if ipex.__version__ < '2.6':
return True
return self.is_synchronized_device()
def default_stream(self, device_index=None):
# torch.xpu doesn't support CUDA-style default stream sync
return torch.xpu.current_stream(device_index)
def pin_memory(self, tensor, align_bytes=1):
if align_bytes == 1:
return tensor.pin_memory(device=self.current_device_name())
elif align_bytes == 0:
# Use AsyncIOBuilder for aligned allocation
self.aio_handle = AsyncIOBuilder().load().aio_handle(...)
aligned_t = self.aio_handle.new_cpu_locked_tensor(...)
self.aligned_tensors.append([aligned_t.data_ptr(), ...])
return aligned_t
def is_bf16_supported(self):
return True
def is_fp16_supported(self):
return True
def build_extension(self):
from intel_extension_for_pytorch.xpu.cpp_extension import DpcppBuildExtension
return DpcppBuildExtension
def visible_devices_envs(self):
return ['ZE_AFFINITY_MASK']
Import
from deepspeed.accelerator.xpu_accelerator import XPU_Accelerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| device_index | int | Optional | XPU device index |
| seed | int | Required | Random seed for XPU RNG |
| align_bytes | int | Optional | Alignment for pin_memory (1 or 0) |
Outputs
| Name | Type | Description |
|---|---|---|
| device | torch.device | XPU device object |
| device_count | int | Number of XPU devices |
| memory_bytes | int | XPU memory in bytes |
| communication_backend | str | 'ccl' or 'xccl' |
Usage Examples
# Set XPU accelerator
import os
os.environ['DS_ACCELERATOR'] = 'xpu'
from deepspeed.accelerator import get_accelerator
accelerator = get_accelerator()
print(f"Device: {accelerator.device_name()}") # 'xpu'
print(f"Backend: {accelerator.communication_backend_name()}") # 'ccl' or 'xccl'
# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
print(f"Current device: {accelerator.current_device_name()}")
# Precision support - both available
print(f"FP16: {accelerator.is_fp16_supported()}") # True
print(f"BF16: {accelerator.is_bf16_supported()}") # True
print(f"Triton: {accelerator.is_triton_supported()}") # False
# Memory operations
total = accelerator.total_memory(0)
allocated = accelerator.memory_allocated(0)
available = accelerator.available_memory(0)
# Pin memory with alignment options
tensor = torch.randn(1000)
pinned_tensor = accelerator.pin_memory(tensor, align_bytes=1)
aligned_tensor = accelerator.pin_memory(tensor, align_bytes=0)
# Note: default_stream returns current_stream
default = accelerator.default_stream()
current = accelerator.current_stream()
print(f"Streams same: {default is current}") # True (workaround)
# Synchronization
accelerator.synchronize()
Related Pages
- Abstract Accelerator - Base interface
- Real Accelerator - Accelerator selection
- CUDA Accelerator - NVIDIA alternative