Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Alibaba ROLL ROCm GPU Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, GPU_Computing
Last Updated 2026-02-07 19:00 GMT

Overview

AMD ROCm GPU environment with RCCL communication backend and HIP device management for running ROLL on AMD GPUs.

Description

This environment provides GPU-accelerated context for ROLL on AMD hardware using ROCm (Radeon Open Compute). The platform uses `HIP_VISIBLE_DEVICES` for device control and RCCL (ROCm Communication Collectives Library, compatible with NCCL API) for distributed communication. The `RocmPlatform` is auto-detected when `torch.cuda.get_device_name()` contains "AMD". ROCm-specific vLLM optimizations are enabled via `VLLM_ROCM_USE_AITER`, `VLLM_ROCM_USE_AITER_MOE`, and `VLLM_ROCM_USE_AITER_PAGED_ATTN` environment variables. Note that vLLM V1 mode is disabled on ROCm (`VLLM_USE_V1=0`).

Usage

Use this environment when running ROLL on AMD GPU hardware (e.g., MI250, MI300). This is a secondary supported platform with limited backend support compared to NVIDIA CUDA. Pre-built Docker images are strongly recommended.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 22.04+) ROCm requires specific kernel versions
Hardware AMD GPU with ROCm support MI250, MI300 series recommended
ROCm >= 6.3.4 Required for compatibility
Disk 50GB+ SSD For model checkpoints and datasets

Dependencies

System Packages

  • `rocm` >= 6.3.4
  • `rccl` (ROCm communication library)
  • `hipblas` (HIP BLAS library)

Python Packages

  • `torch` == 2.8.0 (ROCm build)
  • `vllm` >= 0.8.4 (ROCm build)
  • `deepspeed` == 0.16.4
  • `ray[default,cgraph]` == 2.48.0
  • All common dependencies from `requirements_common.txt`

Credentials

  • `HIP_VISIBLE_DEVICES`: AMD GPU device visibility (set internally by ROLL)
  • `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`: Prevents Ray from overriding device visibility
  • `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES`: Prevents ROCr device override

Quick Install

# Using pre-built Docker image (strongly recommended for ROCm)
docker pull rlsys/roll_opensource
docker run -dit --device=/dev/kfd --device=/dev/dri --ipc=host \
  rlsys/roll_opensource

Code Evidence

Platform detection from `roll/platforms/__init__.py:34-36`:

elif "AMD" in device_name:
    logger.debug("Initializing ROCm platform (AMD).")
    return RocmPlatform()

ROCm-specific environment variables from `roll/platforms/rocm.py:32-51`:

@classmethod
def get_custom_env_vars(cls) -> dict:
    env_vars = {
        "VLLM_ROCM_USE_AITER": "1",
        "VLLM_ROCM_USE_AITER_MOE": "1",
        "VLLM_ROCM_USE_AITER_PAGED_ATTN": "1",
        "VLLM_USE_V1": "0",
        "PYTORCH_HIP_ALLOC_CONF": "expandable_segments:True",
    }
    return env_vars

Common Errors

Error Message Cause Solution
`RuntimeError: vLLM is not installed or not properly configured` vLLM ROCm build not installed Use pre-built Docker image with ROCm vLLM
RCCL communication failures RCCL misconfiguration Enable RCCL debug: `NCCL_DEBUG=INFO NCCL_DEBUG_FILE=rccl.%h.%p.log`

Compatibility Notes

  • vLLM V1: Disabled on ROCm (`VLLM_USE_V1=0`). Only V0 engine is used.
  • SGLang: Not tested on ROCm.
  • Megatron: Not tested on ROCm.
  • Flash Attention: ROCm uses AITER kernels instead.
  • Pre-built images: Strongly recommended; building from source is difficult.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment