Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed XPU Accelerator

From Leeroopedia


Knowledge Sources
Domains Accelerator, Intel GPU Backend
Last Updated 2026-02-09 00:00 GMT

Overview

Intel XPU (GPU) accelerator backend enabling DeepSpeed training on Intel discrete and integrated GPUs.

Description

The XPU_Accelerator class implements the DeepSpeedAccelerator interface for Intel XPU (GPU) accelerators. It wraps torch.xpu APIs with optional intel_extension_for_pytorch integration and uses ccl (via oneccl_bindings_for_pytorch) or xccl as the communication backend with inductor for compilation. Contains several workarounds for XPU-specific behaviors: host timers are used for IPEX versions < 2.6 due to event issues, default_stream returns current_stream since torch.xpu doesn't support CUDA-style default stream sync behavior, and pin_memory supports aligned allocation via AsyncIOBuilder.aio_handle when align_bytes=0, tracking aligned tensors for is_pinned checks. Both FP16 and BF16 are supported. Uses ZE_AFFINITY_MASK for device visibility. Build extension uses DpcppBuildExtension from IPEX or torch.

Usage

Use when training on Intel GPU hardware (Arc, Data Center Max series). Requires torch.xpu support, optionally with intel_extension_for_pytorch. Set DS_ACCELERATOR=xpu to explicitly select this backend.

Code Reference

Source Location

Signature

class XPU_Accelerator(DeepSpeedAccelerator):
    def __init__(self):
        self._name = 'xpu'
        if oneccl_imported_p:
            self._communication_backend_name = 'ccl'
        else:
            self._communication_backend_name = 'xccl'
        self._compile_backend = "inductor"
        self.aligned_tensors = []
        self.class_dict = None

    def is_synchronized_device(self):
        return False

    def use_host_timers(self):
        # WA for XPU event issues in IPEX < 2.6
        if ipex.__version__ < '2.6':
            return True
        return self.is_synchronized_device()

    def default_stream(self, device_index=None):
        # torch.xpu doesn't support CUDA-style default stream sync
        return torch.xpu.current_stream(device_index)

    def pin_memory(self, tensor, align_bytes=1):
        if align_bytes == 1:
            return tensor.pin_memory(device=self.current_device_name())
        elif align_bytes == 0:
            # Use AsyncIOBuilder for aligned allocation
            self.aio_handle = AsyncIOBuilder().load().aio_handle(...)
            aligned_t = self.aio_handle.new_cpu_locked_tensor(...)
            self.aligned_tensors.append([aligned_t.data_ptr(), ...])
            return aligned_t

    def is_bf16_supported(self):
        return True

    def is_fp16_supported(self):
        return True

    def build_extension(self):
        from intel_extension_for_pytorch.xpu.cpp_extension import DpcppBuildExtension
        return DpcppBuildExtension

    def visible_devices_envs(self):
        return ['ZE_AFFINITY_MASK']

Import

from deepspeed.accelerator.xpu_accelerator import XPU_Accelerator

I/O Contract

Inputs

Name Type Required Description
device_index int Optional XPU device index
seed int Required Random seed for XPU RNG
align_bytes int Optional Alignment for pin_memory (1 or 0)

Outputs

Name Type Description
device torch.device XPU device object
device_count int Number of XPU devices
memory_bytes int XPU memory in bytes
communication_backend str 'ccl' or 'xccl'

Usage Examples

# Set XPU accelerator
import os
os.environ['DS_ACCELERATOR'] = 'xpu'

from deepspeed.accelerator import get_accelerator
accelerator = get_accelerator()

print(f"Device: {accelerator.device_name()}")  # 'xpu'
print(f"Backend: {accelerator.communication_backend_name()}")  # 'ccl' or 'xccl'

# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
print(f"Current device: {accelerator.current_device_name()}")

# Precision support - both available
print(f"FP16: {accelerator.is_fp16_supported()}")  # True
print(f"BF16: {accelerator.is_bf16_supported()}")  # True
print(f"Triton: {accelerator.is_triton_supported()}")  # False

# Memory operations
total = accelerator.total_memory(0)
allocated = accelerator.memory_allocated(0)
available = accelerator.available_memory(0)

# Pin memory with alignment options
tensor = torch.randn(1000)
pinned_tensor = accelerator.pin_memory(tensor, align_bytes=1)
aligned_tensor = accelerator.pin_memory(tensor, align_bytes=0)

# Note: default_stream returns current_stream
default = accelerator.default_stream()
current = accelerator.current_stream()
print(f"Streams same: {default is current}")  # True (workaround)

# Synchronization
accelerator.synchronize()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment