Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba ROLL Hardware Platform Abstraction

From Leeroopedia


Knowledge Sources
Domains Hardware_Abstraction, Distributed_Computing
Last Updated 2026-02-07 20:00 GMT

Overview

An abstraction layer that standardizes device metadata and operations across heterogeneous accelerator types, enabling the training system to run on different hardware without code changes.

Description

Modern AI training infrastructure may use different hardware accelerators: NVIDIA GPUs (CUDA), AMD GPUs (ROCm), or Huawei Ascend NPUs (HCCL). Each accelerator has its own PyTorch backend, communication library, device visibility environment variable, and runtime API. Without an abstraction layer, hardware-specific logic would be scattered throughout the codebase in conditional branches.

This principle defines a Platform base class that standardizes:

  1. Device Metadata: Each platform declares its device name, PyTorch device type (cuda, npu), dispatch key (CUDA, PrivateUse1), and communication backend (nccl, hccl). This metadata is used for initializing distributed process groups, configuring Ray for multi-node scheduling, and setting device visibility.
  1. Lazy Attribute Delegation: When an attribute is not found on the Platform instance, the __getattr__ fallback looks it up on the corresponding torch.<device_type> module. This means platform-agnostic code can call current_platform.set_device(), current_platform.device_count(), or current_platform.get_rng_state() without knowing the underlying device type.
  1. Platform-Specific Operations: Each concrete platform implements operations that differ across hardware: clearing cuBLAS workspaces, configuring memory allocators, providing the correct vLLM worker class, and setting runtime environment variables.
  1. Automatic Detection: At import time, the platform module probes available hardware (torch.cuda.is_available(), torch_npu import) and instantiates the appropriate Platform subclass as a module-level singleton. All downstream code references this singleton.

Usage

Use this principle when:

  • The training system must run on multiple accelerator types (NVIDIA, AMD, Huawei Ascend) without hardware-specific branching in the training logic.
  • You need a single API for device management operations (set device, get RNG state, clear workspaces) that works across all supported platforms.
  • The distributed scheduling system (e.g., Ray) requires platform-specific environment variables for device visibility and process placement.

Theoretical Basis

Platform detection priority:

IF torch.cuda.is_available():
    device_name = torch.cuda.get_device_name()
    IF "NVIDIA" in device_name:
        platform = CudaPlatform()    # NCCL backend
    ELIF "AMD" in device_name:
        platform = RocmPlatform()    # RCCL backend
    ELSE:
        platform = UnknownPlatform()
ELIF torch_npu is importable:
    platform = NpuPlatform()         # HCCL backend
ELSE:
    platform = CpuPlatform()         # Gloo backend

Platform interface contract:

Property CUDA ROCm Ascend NPU
device_type cuda cuda npu
dispatch_key CUDA CUDA PrivateUse1
ray_device_key GPU GPU NPU
communication_backend nccl nccl hccl
device_control_env_var CUDA_VISIBLE_DEVICES CUDA_VISIBLE_DEVICES ASCEND_RT_VISIBLE_DEVICES

Lazy delegation pattern:

def __getattr__(self, key):
    device_module = torch.<device_type>
    IF hasattr(device_module, key):
        RETURN getattr(device_module, key)
    ELSE:
        log warning and RETURN None

This allows code like current_platform.device_count() to transparently call torch.cuda.device_count() or torch.npu.device_count() depending on the detected platform.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment