Implementation:Deepspeedai DeepSpeed MLU Accelerator
| Knowledge Sources | |
|---|---|
| Domains | Accelerator, Cambricon Backend |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Cambricon MLU (Machine Learning Unit) accelerator backend enabling DeepSpeed training on Cambricon hardware.
Description
The MLU_Accelerator class implements the DeepSpeedAccelerator interface for Cambricon MLU AI accelerators. It wraps torch.mlu APIs provided by the torch_mlu extension and uses cncl (Cambricon NCCL) as the communication backend with inductor as the compile backend. All standard device, memory, RNG, and stream/event operations delegate to torch.mlu equivalents. The implementation supports MLU graphs via torch.mlu.MLUGraph() and CNPX profiling ranges. Triton JIT compilation is supported (is_triton_supported returns True). Uses MLU_VISIBLE_DEVICES for device visibility control and exports NEUWARE_HOME and CNCL environment variables. Op builders are lazily loaded from op_builder.mlu using inspect.getmembers to scan for builder classes.
Usage
Use when training on Cambricon MLU accelerators. Requires torch_mlu to be installed. Set DS_ACCELERATOR=mlu to explicitly select this backend.
Code Reference
Source Location
- Repository: DeepSpeed
- File: accelerator/mlu_accelerator.py
Signature
class MLU_Accelerator(DeepSpeedAccelerator):
def __init__(self):
self._name = 'mlu'
self._communication_backend_name = 'cncl'
self._compile_backend = "inductor"
self.class_dict = None
def is_synchronized_device(self):
return False
def device_name(self, device_index=None):
if device_index is None:
return 'mlu'
return f'mlu:{device_index}'
def device(self, device_index=None):
return torch.mlu.device(device_index)
def synchronize(self, device_index=None):
return torch.mlu.synchronize(device_index)
def is_bf16_supported(self):
return torch.mlu.is_bf16_supported()
def is_triton_supported(self):
return True
def create_graph(self):
return torch.mlu.MLUGraph()
def visible_devices_envs(self):
return ['MLU_VISIBLE_DEVICES']
def export_envs(self):
return ['NEUWARE_HOME', 'CNCL', 'LD_LIBRARY', 'PATH']
Import
from deepspeed.accelerator.mlu_accelerator import MLU_Accelerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| device_index | int | Optional | MLU device index |
| seed | int | Required | Random seed for MLU RNG |
| graph | MLUGraph | Required | Graph to capture/replay |
Outputs
| Name | Type | Description |
|---|---|---|
| device | torch.device | MLU device object |
| device_count | int | Number of MLU devices |
| memory_bytes | int | MLU memory in bytes |
| communication_backend | str | Always 'cncl' |
Usage Examples
# Set MLU accelerator
import os
os.environ['DS_ACCELERATOR'] = 'mlu'
from deepspeed.accelerator import get_accelerator
accelerator = get_accelerator()
print(f"Device: {accelerator.device_name()}") # 'mlu'
print(f"Backend: {accelerator.communication_backend_name()}") # 'cncl'
# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
print(f"Current device: {accelerator.current_device_name()}")
# Precision support
print(f"BF16: {accelerator.is_bf16_supported()}")
print(f"FP16: {accelerator.is_fp16_supported()}")
print(f"Triton: {accelerator.is_triton_supported()}") # True
# Memory operations
total = accelerator.total_memory(0)
allocated = accelerator.memory_allocated(0)
available = accelerator.available_memory(0)
print(f"Memory: {allocated}/{total} bytes")
# MLU graphs
graph = accelerator.create_graph()
with accelerator.capture_to_graph(graph):
output = model(input_tensor)
accelerator.replay_graph(graph)
# Profiling
accelerator.range_push("forward_pass")
output = model(input_tensor)
accelerator.range_pop()
Related Pages
- Abstract Accelerator - Base interface
- Real Accelerator - Accelerator selection
- NPU Accelerator - Huawei alternative