Implementation:Deepspeedai DeepSpeed MLU Accelerator

Knowledge Sources	DeepSpeed
Domains	Accelerator, Cambricon Backend
Last Updated	2026-02-09 00:00 GMT

Overview

Cambricon MLU (Machine Learning Unit) accelerator backend enabling DeepSpeed training on Cambricon hardware.

Description

The MLU_Accelerator class implements the DeepSpeedAccelerator interface for Cambricon MLU AI accelerators. It wraps torch.mlu APIs provided by the torch_mlu extension and uses cncl (Cambricon NCCL) as the communication backend with inductor as the compile backend. All standard device, memory, RNG, and stream/event operations delegate to torch.mlu equivalents. The implementation supports MLU graphs via torch.mlu.MLUGraph() and CNPX profiling ranges. Triton JIT compilation is supported (is_triton_supported returns True). Uses MLU_VISIBLE_DEVICES for device visibility control and exports NEUWARE_HOME and CNCL environment variables. Op builders are lazily loaded from op_builder.mlu using inspect.getmembers to scan for builder classes.

Usage

Use when training on Cambricon MLU accelerators. Requires torch_mlu to be installed. Set DS_ACCELERATOR=mlu to explicitly select this backend.

Code Reference

Source Location

Repository: DeepSpeed
File: accelerator/mlu_accelerator.py

Signature

class MLU_Accelerator(DeepSpeedAccelerator):
    def __init__(self):
        self._name = 'mlu'
        self._communication_backend_name = 'cncl'
        self._compile_backend = "inductor"
        self.class_dict = None

    def is_synchronized_device(self):
        return False

    def device_name(self, device_index=None):
        if device_index is None:
            return 'mlu'
        return f'mlu:{device_index}'

    def device(self, device_index=None):
        return torch.mlu.device(device_index)

    def synchronize(self, device_index=None):
        return torch.mlu.synchronize(device_index)

    def is_bf16_supported(self):
        return torch.mlu.is_bf16_supported()

    def is_triton_supported(self):
        return True

    def create_graph(self):
        return torch.mlu.MLUGraph()

    def visible_devices_envs(self):
        return ['MLU_VISIBLE_DEVICES']

    def export_envs(self):
        return ['NEUWARE_HOME', 'CNCL', 'LD_LIBRARY', 'PATH']

Import

from deepspeed.accelerator.mlu_accelerator import MLU_Accelerator

I/O Contract

Inputs

Name	Type	Required	Description
device_index	int	Optional	MLU device index
seed	int	Required	Random seed for MLU RNG
graph	MLUGraph	Required	Graph to capture/replay

Outputs

Name	Type	Description
device	torch.device	MLU device object
device_count	int	Number of MLU devices
memory_bytes	int	MLU memory in bytes
communication_backend	str	Always 'cncl'

Usage Examples

# Set MLU accelerator
import os
os.environ['DS_ACCELERATOR'] = 'mlu'

from deepspeed.accelerator import get_accelerator
accelerator = get_accelerator()

print(f"Device: {accelerator.device_name()}")  # 'mlu'
print(f"Backend: {accelerator.communication_backend_name()}")  # 'cncl'

# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
print(f"Current device: {accelerator.current_device_name()}")

# Precision support
print(f"BF16: {accelerator.is_bf16_supported()}")
print(f"FP16: {accelerator.is_fp16_supported()}")
print(f"Triton: {accelerator.is_triton_supported()}")  # True

# Memory operations
total = accelerator.total_memory(0)
allocated = accelerator.memory_allocated(0)
available = accelerator.available_memory(0)
print(f"Memory: {allocated}/{total} bytes")

# MLU graphs
graph = accelerator.create_graph()
with accelerator.capture_to_graph(graph):
    output = model(input_tensor)
accelerator.replay_graph(graph)

# Profiling
accelerator.range_push("forward_pass")
output = model(input_tensor)
accelerator.range_pop()

Related Pages

Abstract Accelerator - Base interface
Real Accelerator - Accelerator selection
NPU Accelerator - Huawei alternative

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment