Principle:Alibaba ROLL Hardware Platform Abstraction
| Knowledge Sources | |
|---|---|
| Domains | Hardware_Abstraction, Distributed_Computing |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
An abstraction layer that standardizes device metadata and operations across heterogeneous accelerator types, enabling the training system to run on different hardware without code changes.
Description
Modern AI training infrastructure may use different hardware accelerators: NVIDIA GPUs (CUDA), AMD GPUs (ROCm), or Huawei Ascend NPUs (HCCL). Each accelerator has its own PyTorch backend, communication library, device visibility environment variable, and runtime API. Without an abstraction layer, hardware-specific logic would be scattered throughout the codebase in conditional branches.
This principle defines a Platform base class that standardizes:
- Device Metadata: Each platform declares its device name, PyTorch device type (cuda, npu), dispatch key (CUDA, PrivateUse1), and communication backend (nccl, hccl). This metadata is used for initializing distributed process groups, configuring Ray for multi-node scheduling, and setting device visibility.
- Lazy Attribute Delegation: When an attribute is not found on the Platform instance, the __getattr__ fallback looks it up on the corresponding torch.<device_type> module. This means platform-agnostic code can call current_platform.set_device(), current_platform.device_count(), or current_platform.get_rng_state() without knowing the underlying device type.
- Platform-Specific Operations: Each concrete platform implements operations that differ across hardware: clearing cuBLAS workspaces, configuring memory allocators, providing the correct vLLM worker class, and setting runtime environment variables.
- Automatic Detection: At import time, the platform module probes available hardware (torch.cuda.is_available(), torch_npu import) and instantiates the appropriate Platform subclass as a module-level singleton. All downstream code references this singleton.
Usage
Use this principle when:
- The training system must run on multiple accelerator types (NVIDIA, AMD, Huawei Ascend) without hardware-specific branching in the training logic.
- You need a single API for device management operations (set device, get RNG state, clear workspaces) that works across all supported platforms.
- The distributed scheduling system (e.g., Ray) requires platform-specific environment variables for device visibility and process placement.
Theoretical Basis
Platform detection priority:
IF torch.cuda.is_available():
device_name = torch.cuda.get_device_name()
IF "NVIDIA" in device_name:
platform = CudaPlatform() # NCCL backend
ELIF "AMD" in device_name:
platform = RocmPlatform() # RCCL backend
ELSE:
platform = UnknownPlatform()
ELIF torch_npu is importable:
platform = NpuPlatform() # HCCL backend
ELSE:
platform = CpuPlatform() # Gloo backend
Platform interface contract:
| Property | CUDA | ROCm | Ascend NPU |
|---|---|---|---|
| device_type | cuda | cuda | npu |
| dispatch_key | CUDA | CUDA | PrivateUse1 |
| ray_device_key | GPU | GPU | NPU |
| communication_backend | nccl | nccl | hccl |
| device_control_env_var | CUDA_VISIBLE_DEVICES | CUDA_VISIBLE_DEVICES | ASCEND_RT_VISIBLE_DEVICES |
Lazy delegation pattern:
def __getattr__(self, key):
device_module = torch.<device_type>
IF hasattr(device_module, key):
RETURN getattr(device_module, key)
ELSE:
log warning and RETURN None
This allows code like current_platform.device_count() to transparently call torch.cuda.device_count() or torch.npu.device_count() depending on the detected platform.