Environment:Alibaba ROLL ROCm GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, GPU_Computing |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
AMD ROCm GPU environment with RCCL communication backend and HIP device management for running ROLL on AMD GPUs.
Description
This environment provides GPU-accelerated context for ROLL on AMD hardware using ROCm (Radeon Open Compute). The platform uses `HIP_VISIBLE_DEVICES` for device control and RCCL (ROCm Communication Collectives Library, compatible with NCCL API) for distributed communication. The `RocmPlatform` is auto-detected when `torch.cuda.get_device_name()` contains "AMD". ROCm-specific vLLM optimizations are enabled via `VLLM_ROCM_USE_AITER`, `VLLM_ROCM_USE_AITER_MOE`, and `VLLM_ROCM_USE_AITER_PAGED_ATTN` environment variables. Note that vLLM V1 mode is disabled on ROCm (`VLLM_USE_V1=0`).
Usage
Use this environment when running ROLL on AMD GPU hardware (e.g., MI250, MI300). This is a secondary supported platform with limited backend support compared to NVIDIA CUDA. Pre-built Docker images are strongly recommended.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 22.04+) | ROCm requires specific kernel versions |
| Hardware | AMD GPU with ROCm support | MI250, MI300 series recommended |
| ROCm | >= 6.3.4 | Required for compatibility |
| Disk | 50GB+ SSD | For model checkpoints and datasets |
Dependencies
System Packages
- `rocm` >= 6.3.4
- `rccl` (ROCm communication library)
- `hipblas` (HIP BLAS library)
Python Packages
- `torch` == 2.8.0 (ROCm build)
- `vllm` >= 0.8.4 (ROCm build)
- `deepspeed` == 0.16.4
- `ray[default,cgraph]` == 2.48.0
- All common dependencies from `requirements_common.txt`
Credentials
- `HIP_VISIBLE_DEVICES`: AMD GPU device visibility (set internally by ROLL)
- `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`: Prevents Ray from overriding device visibility
- `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES`: Prevents ROCr device override
Quick Install
# Using pre-built Docker image (strongly recommended for ROCm)
docker pull rlsys/roll_opensource
docker run -dit --device=/dev/kfd --device=/dev/dri --ipc=host \
rlsys/roll_opensource
Code Evidence
Platform detection from `roll/platforms/__init__.py:34-36`:
elif "AMD" in device_name:
logger.debug("Initializing ROCm platform (AMD).")
return RocmPlatform()
ROCm-specific environment variables from `roll/platforms/rocm.py:32-51`:
@classmethod
def get_custom_env_vars(cls) -> dict:
env_vars = {
"VLLM_ROCM_USE_AITER": "1",
"VLLM_ROCM_USE_AITER_MOE": "1",
"VLLM_ROCM_USE_AITER_PAGED_ATTN": "1",
"VLLM_USE_V1": "0",
"PYTORCH_HIP_ALLOC_CONF": "expandable_segments:True",
}
return env_vars
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: vLLM is not installed or not properly configured` | vLLM ROCm build not installed | Use pre-built Docker image with ROCm vLLM |
| RCCL communication failures | RCCL misconfiguration | Enable RCCL debug: `NCCL_DEBUG=INFO NCCL_DEBUG_FILE=rccl.%h.%p.log` |
Compatibility Notes
- vLLM V1: Disabled on ROCm (`VLLM_USE_V1=0`). Only V0 engine is used.
- SGLang: Not tested on ROCm.
- Megatron: Not tested on ROCm.
- Flash Attention: ROCm uses AITER kernels instead.
- Pre-built images: Strongly recommended; building from source is difficult.