Principle:Alibaba ROLL Hardware Platform Abstraction

Knowledge Sources	Alibaba_ROLL
Domains	Hardware_Abstraction, Distributed_Computing
Last Updated	2026-02-07 20:00 GMT

Overview

An abstraction layer that standardizes device metadata and operations across heterogeneous accelerator types, enabling the training system to run on different hardware without code changes.

Description

Modern AI training infrastructure may use different hardware accelerators: NVIDIA GPUs (CUDA), AMD GPUs (ROCm), or Huawei Ascend NPUs (HCCL). Each accelerator has its own PyTorch backend, communication library, device visibility environment variable, and runtime API. Without an abstraction layer, hardware-specific logic would be scattered throughout the codebase in conditional branches.

This principle defines a Platform base class that standardizes:

Device Metadata: Each platform declares its device name, PyTorch device type (cuda, npu), dispatch key (CUDA, PrivateUse1), and communication backend (nccl, hccl). This metadata is used for initializing distributed process groups, configuring Ray for multi-node scheduling, and setting device visibility.

Lazy Attribute Delegation: When an attribute is not found on the Platform instance, the __getattr__ fallback looks it up on the corresponding torch.<device_type> module. This means platform-agnostic code can call current_platform.set_device(), current_platform.device_count(), or current_platform.get_rng_state() without knowing the underlying device type.

Platform-Specific Operations: Each concrete platform implements operations that differ across hardware: clearing cuBLAS workspaces, configuring memory allocators, providing the correct vLLM worker class, and setting runtime environment variables.

Automatic Detection: At import time, the platform module probes available hardware (torch.cuda.is_available(), torch_npu import) and instantiates the appropriate Platform subclass as a module-level singleton. All downstream code references this singleton.

Usage

Use this principle when:

The training system must run on multiple accelerator types (NVIDIA, AMD, Huawei Ascend) without hardware-specific branching in the training logic.
You need a single API for device management operations (set device, get RNG state, clear workspaces) that works across all supported platforms.
The distributed scheduling system (e.g., Ray) requires platform-specific environment variables for device visibility and process placement.

Theoretical Basis

Platform detection priority:

IF torch.cuda.is_available():
    device_name = torch.cuda.get_device_name()
    IF "NVIDIA" in device_name:
        platform = CudaPlatform()    # NCCL backend
    ELIF "AMD" in device_name:
        platform = RocmPlatform()    # RCCL backend
    ELSE:
        platform = UnknownPlatform()
ELIF torch_npu is importable:
    platform = NpuPlatform()         # HCCL backend
ELSE:
    platform = CpuPlatform()         # Gloo backend

Platform interface contract:

Property	CUDA	ROCm	Ascend NPU
device_type	cuda	cuda	npu
dispatch_key	CUDA	CUDA	PrivateUse1
ray_device_key	GPU	GPU	NPU
communication_backend	nccl	nccl	hccl
device_control_env_var	CUDA_VISIBLE_DEVICES	CUDA_VISIBLE_DEVICES	ASCEND_RT_VISIBLE_DEVICES

Lazy delegation pattern:

def __getattr__(self, key):
    device_module = torch.<device_type>
    IF hasattr(device_module, key):
        RETURN getattr(device_module, key)
    ELSE:
        log warning and RETURN None

This allows code like current_platform.device_count() to transparently call torch.cuda.device_count() or torch.npu.device_count() depending on the detected platform.

Related Pages

Implementation:Alibaba_ROLL_Platform

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment