Environment:Alibaba ROLL ROCm GPU Environment

Knowledge Sources	Alibaba ROLL AMD ROCm
Domains	Infrastructure, Deep_Learning, GPU_Computing
Last Updated	2026-02-07 19:00 GMT

Overview

AMD ROCm GPU environment with RCCL communication backend and HIP device management for running ROLL on AMD GPUs.

Description

This environment provides GPU-accelerated context for ROLL on AMD hardware using ROCm (Radeon Open Compute). The platform uses `HIP_VISIBLE_DEVICES` for device control and RCCL (ROCm Communication Collectives Library, compatible with NCCL API) for distributed communication. The `RocmPlatform` is auto-detected when `torch.cuda.get_device_name()` contains "AMD". ROCm-specific vLLM optimizations are enabled via `VLLM_ROCM_USE_AITER`, `VLLM_ROCM_USE_AITER_MOE`, and `VLLM_ROCM_USE_AITER_PAGED_ATTN` environment variables. Note that vLLM V1 mode is disabled on ROCm (`VLLM_USE_V1=0`).

Usage

Use this environment when running ROLL on AMD GPU hardware (e.g., MI250, MI300). This is a secondary supported platform with limited backend support compared to NVIDIA CUDA. Pre-built Docker images are strongly recommended.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 22.04+)	ROCm requires specific kernel versions
Hardware	AMD GPU with ROCm support	MI250, MI300 series recommended
ROCm	>= 6.3.4	Required for compatibility
Disk	50GB+ SSD	For model checkpoints and datasets

Dependencies

System Packages

`rocm` >= 6.3.4
`rccl` (ROCm communication library)
`hipblas` (HIP BLAS library)

Python Packages

`torch` == 2.8.0 (ROCm build)
`vllm` >= 0.8.4 (ROCm build)
`deepspeed` == 0.16.4
`ray[default,cgraph]` == 2.48.0
All common dependencies from `requirements_common.txt`

Credentials

`HIP_VISIBLE_DEVICES`: AMD GPU device visibility (set internally by ROLL)
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`: Prevents Ray from overriding device visibility
`RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES`: Prevents ROCr device override

Quick Install

# Using pre-built Docker image (strongly recommended for ROCm)
docker pull rlsys/roll_opensource
docker run -dit --device=/dev/kfd --device=/dev/dri --ipc=host \
  rlsys/roll_opensource

Code Evidence

Platform detection from `roll/platforms/__init__.py:34-36`:

elif "AMD" in device_name:
    logger.debug("Initializing ROCm platform (AMD).")
    return RocmPlatform()

ROCm-specific environment variables from `roll/platforms/rocm.py:32-51`:

@classmethod
def get_custom_env_vars(cls) -> dict:
    env_vars = {
        "VLLM_ROCM_USE_AITER": "1",
        "VLLM_ROCM_USE_AITER_MOE": "1",
        "VLLM_ROCM_USE_AITER_PAGED_ATTN": "1",
        "VLLM_USE_V1": "0",
        "PYTORCH_HIP_ALLOC_CONF": "expandable_segments:True",
    }
    return env_vars

Common Errors

Error Message	Cause	Solution
`RuntimeError: vLLM is not installed or not properly configured`	vLLM ROCm build not installed	Use pre-built Docker image with ROCm vLLM
RCCL communication failures	RCCL misconfiguration	Enable RCCL debug: `NCCL_DEBUG=INFO NCCL_DEBUG_FILE=rccl.%h.%p.log`

Compatibility Notes

vLLM V1: Disabled on ROCm (`VLLM_USE_V1=0`). Only V0 engine is used.
SGLang: Not tested on ROCm.
Megatron: Not tested on ROCm.
Flash Attention: ROCm uses AITER kernels instead.
Pre-built images: Strongly recommended; building from source is difficult.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment