Principle:Deepspeedai DeepSpeed Accelerator Abstraction

Knowledge Sources	DeepSpeed DeepSpeed Accelerator Abstraction
Domains	Hardware_Abstraction, Device_Management
Last Updated	2026-02-09 00:00 GMT

Overview

A polymorphic abstraction layer that provides a uniform interface for heterogeneous hardware accelerators, enabling DeepSpeed to run on diverse compute backends without code changes.

Description

Accelerator Abstraction is the design pattern that decouples DeepSpeed's training and inference logic from any specific hardware vendor. At its core is the DeepSpeedAccelerator abstract base class (ABC), which defines a comprehensive interface covering device management, memory allocation, synchronization, random number generation, communication backends, and operator compilation. Each supported hardware platform provides a concrete implementation of this ABC.

The abstraction covers the following categories of operations:

Device management: Device selection, count, current device, device name, and capability queries
Memory management: Allocation, deallocation, caching, pinned (page-locked) memory, and memory statistics
Synchronization: Stream creation, event management, device synchronization, and stream synchronization
Communication: Backend selection (NCCL, Gloo, CCL, HCCL) and distributed process group initialization
Random state: RNG seeding, state save/restore, and generator management for reproducibility
Operator compilation: Compiler flags, include paths, and JIT compilation support for device-specific kernels

When DeepSpeed initializes, the get_accelerator() factory function detects the available hardware and returns the appropriate concrete accelerator instance. All subsequent DeepSpeed operations use this instance rather than calling vendor-specific APIs directly.

Usage

Call deepspeed.accelerator.get_accelerator() to obtain the active accelerator backend. All device operations (memory allocation, synchronization, stream management) should go through this object rather than calling torch.cuda or vendor-specific APIs directly. This ensures portability across hardware platforms.

Theoretical Basis

Strategy pattern / polymorphic dispatch: The accelerator abstraction implements the classic Strategy pattern where a family of algorithms (hardware-specific operations) are encapsulated behind a common interface and made interchangeable. The concrete strategy is selected at initialization time based on hardware detection.

Abstraction layers:

Level 0 (ABC): DeepSpeedAccelerator defines the contract -- every accelerator must implement device_name(), current_device(), mem_get_info(), synchronize(), etc.
Level 1 (Concrete): Each backend (CUDA_Accelerator, XPU_Accelerator, NPU_Accelerator, etc.) maps the abstract interface to vendor-specific APIs (torch.cuda, torch.xpu, torch_npu, etc.)
Level 2 (Factory): Real_Accelerator / get_accelerator() performs runtime detection and caches the singleton instance.

Key design invariants:

Exactly one accelerator is active per process
The accelerator is determined once at startup and does not change
All vendor-specific API calls are routed through the accelerator interface
Operator builders query the accelerator for compiler flags and include paths

Pseudo-code:

# Abstract accelerator pattern
class DeepSpeedAccelerator(ABC):
    @abstractmethod
    def device_name(self) -> str: ...
    @abstractmethod
    def current_device(self) -> int: ...
    @abstractmethod
    def mem_get_info(self) -> Tuple[int, int]: ...
    @abstractmethod
    def synchronize(self, device=None): ...
    @abstractmethod
    def communication_backend_name(self) -> str: ...

class CUDAAccelerator(DeepSpeedAccelerator):
    def device_name(self): return "cuda"
    def current_device(self): return torch.cuda.current_device()
    def synchronize(self, device=None): torch.cuda.synchronize(device)
    def communication_backend_name(self): return "nccl"

# Factory selects the right backend at startup
def get_accelerator():
    if torch.cuda.is_available():
        return CUDAAccelerator()
    elif hasattr(torch, 'xpu') and torch.xpu.is_available():
        return XPUAccelerator()
    # ... other backends ...

Related Pages

Implemented By

Implementation:Deepspeedai_DeepSpeed_Abstract_Accelerator — Abstract base class defining the accelerator interface
Implementation:Deepspeedai_DeepSpeed_CPU_Accelerator — CPU fallback accelerator for non-GPU environments
Implementation:Deepspeedai_DeepSpeed_CUDA_Accelerator — NVIDIA CUDA GPU backend
Implementation:Deepspeedai_DeepSpeed_HPU_Accelerator — Intel Habana Gaudi (HPU) backend
Implementation:Deepspeedai_DeepSpeed_MLU_Accelerator — Cambricon MLU backend
Implementation:Deepspeedai_DeepSpeed_Real_Accelerator — Runtime accelerator detection and singleton factory
Implementation:Deepspeedai_DeepSpeed_SDAA_Accelerator — Teco SDAA backend
Implementation:Deepspeedai_DeepSpeed_XPU_Accelerator — Intel XPU (Data Center GPU Max) backend
Implementation:Deepspeedai_DeepSpeed_MPS_Accelerator — Apple Metal Performance Shaders backend
Implementation:Deepspeedai_DeepSpeed_NPU_Accelerator — Huawei Ascend NPU backend

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment