Implementation:Deepspeedai DeepSpeed HPU Accelerator
| Knowledge Sources | |
|---|---|
| Domains | Accelerator, Habana Backend |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Intel Habana Gaudi (HPU) accelerator backend enabling DeepSpeed training on Habana AI processors.
Description
The HPU_Accelerator class implements the DeepSpeedAccelerator interface for Intel Habana Gaudi AI accelerators. It wraps habana_frameworks.torch.hpu APIs and uses hccl (Habana Collective Communication Library) as the communication backend with hpu_backend for torch.compile. On initialization, it applies HPU-specific workarounds via environment variables (PT_HPU_LAZY_ACC_PAR_MODE=0, PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES=0) and enables deterministic algorithms. Notably, while is_synchronized_device returns False, both resolves_data_dependency and handles_memory_backpressure return True, indicating the HPU handles these concerns differently than GPUs. FP16 support is dynamically checked via habana_frameworks.torch.utils.experimental._is_fp16_supported() while BF16 is always supported. Supports HPU graphs via hpu.HPUGraph().
Usage
Use when training on Intel Habana Gaudi or Gaudi2 accelerators. Requires habana_frameworks.torch.hpu to be installed. Set DS_ACCELERATOR=hpu to explicitly select this backend.
Code Reference
Source Location
- Repository: DeepSpeed
- File: accelerator/hpu_accelerator.py
Signature
class HPU_Accelerator(DeepSpeedAccelerator):
def __init__(self):
self._name = 'hpu'
self._communication_backend_name = 'hccl'
self._compile_backend = "hpu_backend"
self.apply_hpu_workarounds()
import habana_frameworks.torch.hpu as hpu
self.hpu = hpu
torch.use_deterministic_algorithms(True)
def apply_hpu_workarounds(self):
# Sets PT_HPU_LAZY_ACC_PAR_MODE=0
# Sets PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES=0
def is_synchronized_device(self):
return False
def resolves_data_dependency(self):
return True
def handles_memory_backpressure(self):
return True
def device_name(self, device_index=None):
return 'hpu'
def is_fp16_supported(self):
import habana_frameworks.torch.utils.experimental as htexp
return htexp._is_fp16_supported()
def is_bf16_supported(self):
return True
def create_graph(self):
return self.hpu.HPUGraph()
Import
from deepspeed.accelerator.hpu_accelerator import HPU_Accelerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| device_index | int | Optional | HPU device index |
| seed | int | Required | Random seed for HPU RNG |
Outputs
| Name | Type | Description |
|---|---|---|
| device | torch.device | HPU device object |
| device_count | int | Number of HPU devices |
| memory_stats | dict | HPU memory statistics |
| communication_backend | str | Always 'hccl' |
Usage Examples
# Set HPU accelerator
import os
os.environ['DS_ACCELERATOR'] = 'hpu'
from deepspeed.accelerator import get_accelerator
accelerator = get_accelerator()
print(f"Device: {accelerator.device_name()}") # 'hpu'
print(f"Backend: {accelerator.communication_backend_name()}") # 'hccl'
# Check precision support
print(f"BF16: {accelerator.is_bf16_supported()}") # True
print(f"FP16: {accelerator.is_fp16_supported()}") # Depends on HPU generation
# Memory management
total = accelerator.total_memory()
allocated = accelerator.memory_allocated()
print(f"Memory: {allocated}/{total} bytes")
# HPU graphs
graph = accelerator.create_graph()
with accelerator.capture_to_graph(graph):
output = model(input_data)
accelerator.replay_graph(graph)
# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
accelerator.synchronize()
Related Pages
- Abstract Accelerator - Base interface
- Real Accelerator - Accelerator selection
- CUDA Accelerator - NVIDIA alternative