Principle:Deepspeedai DeepSpeed Accelerator Abstraction
| Knowledge Sources | |
|---|---|
| Domains | Hardware_Abstraction, Device_Management |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A polymorphic abstraction layer that provides a uniform interface for heterogeneous hardware accelerators, enabling DeepSpeed to run on diverse compute backends without code changes.
Description
Accelerator Abstraction is the design pattern that decouples DeepSpeed's training and inference logic from any specific hardware vendor. At its core is the DeepSpeedAccelerator abstract base class (ABC), which defines a comprehensive interface covering device management, memory allocation, synchronization, random number generation, communication backends, and operator compilation. Each supported hardware platform provides a concrete implementation of this ABC.
The abstraction covers the following categories of operations:
- Device management: Device selection, count, current device, device name, and capability queries
- Memory management: Allocation, deallocation, caching, pinned (page-locked) memory, and memory statistics
- Synchronization: Stream creation, event management, device synchronization, and stream synchronization
- Communication: Backend selection (NCCL, Gloo, CCL, HCCL) and distributed process group initialization
- Random state: RNG seeding, state save/restore, and generator management for reproducibility
- Operator compilation: Compiler flags, include paths, and JIT compilation support for device-specific kernels
When DeepSpeed initializes, the get_accelerator() factory function detects the available hardware and returns the appropriate concrete accelerator instance. All subsequent DeepSpeed operations use this instance rather than calling vendor-specific APIs directly.
Usage
Call deepspeed.accelerator.get_accelerator() to obtain the active accelerator backend. All device operations (memory allocation, synchronization, stream management) should go through this object rather than calling torch.cuda or vendor-specific APIs directly. This ensures portability across hardware platforms.
Theoretical Basis
Strategy pattern / polymorphic dispatch: The accelerator abstraction implements the classic Strategy pattern where a family of algorithms (hardware-specific operations) are encapsulated behind a common interface and made interchangeable. The concrete strategy is selected at initialization time based on hardware detection.
Abstraction layers:
- Level 0 (ABC): DeepSpeedAccelerator defines the contract -- every accelerator must implement device_name(), current_device(), mem_get_info(), synchronize(), etc.
- Level 1 (Concrete): Each backend (CUDA_Accelerator, XPU_Accelerator, NPU_Accelerator, etc.) maps the abstract interface to vendor-specific APIs (torch.cuda, torch.xpu, torch_npu, etc.)
- Level 2 (Factory): Real_Accelerator / get_accelerator() performs runtime detection and caches the singleton instance.
Key design invariants:
- Exactly one accelerator is active per process
- The accelerator is determined once at startup and does not change
- All vendor-specific API calls are routed through the accelerator interface
- Operator builders query the accelerator for compiler flags and include paths
Pseudo-code:
# Abstract accelerator pattern
class DeepSpeedAccelerator(ABC):
@abstractmethod
def device_name(self) -> str: ...
@abstractmethod
def current_device(self) -> int: ...
@abstractmethod
def mem_get_info(self) -> Tuple[int, int]: ...
@abstractmethod
def synchronize(self, device=None): ...
@abstractmethod
def communication_backend_name(self) -> str: ...
class CUDAAccelerator(DeepSpeedAccelerator):
def device_name(self): return "cuda"
def current_device(self): return torch.cuda.current_device()
def synchronize(self, device=None): torch.cuda.synchronize(device)
def communication_backend_name(self): return "nccl"
# Factory selects the right backend at startup
def get_accelerator():
if torch.cuda.is_available():
return CUDAAccelerator()
elif hasattr(torch, 'xpu') and torch.xpu.is_available():
return XPUAccelerator()
# ... other backends ...
Related Pages
Implemented By
- Implementation:Deepspeedai_DeepSpeed_Abstract_Accelerator — Abstract base class defining the accelerator interface
- Implementation:Deepspeedai_DeepSpeed_CPU_Accelerator — CPU fallback accelerator for non-GPU environments
- Implementation:Deepspeedai_DeepSpeed_CUDA_Accelerator — NVIDIA CUDA GPU backend
- Implementation:Deepspeedai_DeepSpeed_HPU_Accelerator — Intel Habana Gaudi (HPU) backend
- Implementation:Deepspeedai_DeepSpeed_MLU_Accelerator — Cambricon MLU backend
- Implementation:Deepspeedai_DeepSpeed_Real_Accelerator — Runtime accelerator detection and singleton factory
- Implementation:Deepspeedai_DeepSpeed_SDAA_Accelerator — Teco SDAA backend
- Implementation:Deepspeedai_DeepSpeed_XPU_Accelerator — Intel XPU (Data Center GPU Max) backend
- Implementation:Deepspeedai_DeepSpeed_MPS_Accelerator — Apple Metal Performance Shaders backend
- Implementation:Deepspeedai_DeepSpeed_NPU_Accelerator — Huawei Ascend NPU backend