Implementation:Deepspeedai DeepSpeed HPU Accelerator

Knowledge Sources	DeepSpeed
Domains	Accelerator, Habana Backend
Last Updated	2026-02-09 00:00 GMT

Overview

Intel Habana Gaudi (HPU) accelerator backend enabling DeepSpeed training on Habana AI processors.

Description

The HPU_Accelerator class implements the DeepSpeedAccelerator interface for Intel Habana Gaudi AI accelerators. It wraps habana_frameworks.torch.hpu APIs and uses hccl (Habana Collective Communication Library) as the communication backend with hpu_backend for torch.compile. On initialization, it applies HPU-specific workarounds via environment variables (PT_HPU_LAZY_ACC_PAR_MODE=0, PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES=0) and enables deterministic algorithms. Notably, while is_synchronized_device returns False, both resolves_data_dependency and handles_memory_backpressure return True, indicating the HPU handles these concerns differently than GPUs. FP16 support is dynamically checked via habana_frameworks.torch.utils.experimental._is_fp16_supported() while BF16 is always supported. Supports HPU graphs via hpu.HPUGraph().

Usage

Use when training on Intel Habana Gaudi or Gaudi2 accelerators. Requires habana_frameworks.torch.hpu to be installed. Set DS_ACCELERATOR=hpu to explicitly select this backend.

Code Reference

Source Location

Repository: DeepSpeed
File: accelerator/hpu_accelerator.py

Signature

class HPU_Accelerator(DeepSpeedAccelerator):
    def __init__(self):
        self._name = 'hpu'
        self._communication_backend_name = 'hccl'
        self._compile_backend = "hpu_backend"
        self.apply_hpu_workarounds()
        import habana_frameworks.torch.hpu as hpu
        self.hpu = hpu
        torch.use_deterministic_algorithms(True)

    def apply_hpu_workarounds(self):
        # Sets PT_HPU_LAZY_ACC_PAR_MODE=0
        # Sets PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES=0

    def is_synchronized_device(self):
        return False

    def resolves_data_dependency(self):
        return True

    def handles_memory_backpressure(self):
        return True

    def device_name(self, device_index=None):
        return 'hpu'

    def is_fp16_supported(self):
        import habana_frameworks.torch.utils.experimental as htexp
        return htexp._is_fp16_supported()

    def is_bf16_supported(self):
        return True

    def create_graph(self):
        return self.hpu.HPUGraph()

Import

from deepspeed.accelerator.hpu_accelerator import HPU_Accelerator

I/O Contract

Inputs

Name	Type	Required	Description
device_index	int	Optional	HPU device index
seed	int	Required	Random seed for HPU RNG

Outputs

Name	Type	Description
device	torch.device	HPU device object
device_count	int	Number of HPU devices
memory_stats	dict	HPU memory statistics
communication_backend	str	Always 'hccl'

Usage Examples

# Set HPU accelerator
import os
os.environ['DS_ACCELERATOR'] = 'hpu'

from deepspeed.accelerator import get_accelerator
accelerator = get_accelerator()

print(f"Device: {accelerator.device_name()}")  # 'hpu'
print(f"Backend: {accelerator.communication_backend_name()}")  # 'hccl'

# Check precision support
print(f"BF16: {accelerator.is_bf16_supported()}")  # True
print(f"FP16: {accelerator.is_fp16_supported()}")  # Depends on HPU generation

# Memory management
total = accelerator.total_memory()
allocated = accelerator.memory_allocated()
print(f"Memory: {allocated}/{total} bytes")

# HPU graphs
graph = accelerator.create_graph()
with accelerator.capture_to_graph(graph):
    output = model(input_data)
accelerator.replay_graph(graph)

# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
accelerator.synchronize()

Related Pages

Abstract Accelerator - Base interface
Real Accelerator - Accelerator selection
CUDA Accelerator - NVIDIA alternative

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment