Environment:Deepspeedai DeepSpeed CUDA GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, Distributed_Training |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
NVIDIA CUDA GPU environment with NCCL communication backend, required for GPU-accelerated DeepSpeed training and inference operations.
Description
This environment provides the primary GPU-accelerated context for DeepSpeed. It requires an NVIDIA GPU with CUDA support, the NCCL communication backend for distributed operations, and optionally pynvml (nvidia-ml-py) for GPU memory monitoring. The CUDA accelerator is automatically detected when `torch.cuda.device_count() > 0` and `torch.cuda.is_available()` return true, or can be forced via the `DS_ACCELERATOR=cuda` environment variable.
FP16 support requires compute capability >= 7.0 (Volta or newer). Compute capability 6.x (Pascal) FP16 is deprecated and requires setting `DS_ALLOW_DEPRECATED_FP16=1`. BF16 support requires Ampere (compute capability >= 8.0) or newer. Triton kernels require compute capability >= 8.0 (Ampere+).
Usage
Use this environment for any DeepSpeed training or inference workflow that requires GPU acceleration. This is the mandatory prerequisite for running ZeRO-optimized training, tensor parallelism, pipeline parallelism, hybrid engine RLHF, and inference engine optimization. The CUDA accelerator uses NCCL as its communication backend on Linux and Gloo on Windows.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended), Windows (limited) | Windows uses Gloo backend instead of NCCL; ops are pre-compiled on Windows vs JIT on Linux |
| Hardware | NVIDIA GPU with CUDA support | Compute capability >= 7.0 for FP16; >= 8.0 for BF16 and Triton |
| GPU Memory | Varies by workload | Minimum depends on model size, ZeRO stage, and offload configuration |
| Shared Memory | /dev/shm >= 1GB recommended | Required for NCCL; Docker default may be too small (see `--shm-size`) |
| CUDA Toolkit | CUDA 11.x or newer | Must match PyTorch CUDA version; nvcc required for JIT compilation |
| Communication | NCCL (Linux) / Gloo (Windows) | NCCL required for multi-GPU distributed training on Linux |
Dependencies
System Packages
- `cuda-toolkit` (matching PyTorch CUDA version)
- `nvidia-driver` (compatible with CUDA toolkit)
- `ninja` (for JIT compilation of C++/CUDA ops)
- `nccl` (bundled with PyTorch on Linux)
Python Packages
- `torch` (with CUDA support)
- `nvidia-ml-py` (pynvml, auto-installed for NVIDIA GPUs; not installed for ROCm)
Optional Python Packages
- `triton` >= 2.1.0 (for Triton-based kernels; requires compute capability >= 8.0)
- `apex` (for AMP mixed precision; alternative to native PyTorch AMP)
Credentials
The following environment variables control CUDA accelerator behavior:
- `DS_ACCELERATOR`: Override accelerator detection. Set to `cuda` to force CUDA backend.
- `CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible to the process. pynvml auto-remaps device IDs.
- `TORCH_CUDA_ARCH_LIST`: Set compute capabilities for JIT compilation (auto-configured if not set).
- `DS_BUILD_OPS`: Set to `1` to pre-compile C++/CUDA ops at install time (default: JIT on Linux, pre-compile on Windows).
- `DS_ENABLE_NINJA`: Set to enable ninja build system for pre-installed ops.
- `DS_ALLOW_DEPRECATED_FP16`: Set to `1` to allow FP16 on Pascal GPUs (compute capability 6.x).
Quick Install
# Install PyTorch with CUDA support (example for CUDA 12.1)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# Install DeepSpeed (ops are JIT compiled on first use on Linux)
pip install deepspeed
# Verify environment
ds_report
Code Evidence
Accelerator auto-detection from `accelerator/real_accelerator.py:191-202`:
if accelerator_name is None:
try:
import torch
if torch.cuda.device_count() > 0 and torch.cuda.is_available(): #ignore-cuda
accelerator_name = "cuda"
except (RuntimeError, ImportError) as e:
pass
FP16 compute capability check from `accelerator/cuda_accelerator.py:202-214`:
def is_fp16_supported(self):
if not torch.cuda.is_available():
return True
allow_deprecated_fp16 = os.environ.get('DS_ALLOW_DEPRECATED_FP16', '0') == '1'
major, _ = torch.cuda.get_device_capability()
if major >= 7:
return True
elif major == 6 and allow_deprecated_fp16:
return True
else:
return False
Triton support check from `accelerator/cuda_accelerator.py:247-254`:
def is_triton_supported(self):
if not self.is_available():
return False
major, _ = torch.cuda.get_device_capability()
if major >= 8:
return True
else:
return False
NCCL communication backend from `accelerator/cuda_accelerator.py:28`:
self._communication_backend_name = 'nccl' if sys.platform != 'win32' else 'gloo'
Shared memory check from `deepspeed/env_report.py:103-120`:
def get_shm_size():
shm_stats = os.statvfs('/dev/shm')
shm_size = shm_stats.f_frsize * shm_stats.f_blocks
if shm_size < 512 * 1024**2:
warn.append(
" [WARNING] /dev/shm size might be too small, if running in docker "
"increase to at least --shm-size='1gb'"
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `DS_ACCELERATOR must be one of [...]` | Invalid accelerator name in DS_ACCELERATOR env var | Set DS_ACCELERATOR to one of: cuda, cpu, xpu, xpu.external, npu, mps, hpu, mlu, sdaa |
| `Setting accelerator to CPU` (warning) | No GPU detected | Ensure NVIDIA drivers and CUDA toolkit are installed; check `nvidia-smi` |
| `[FAIL] cannot find CUDA_HOME` | CUDA toolkit not found | Install CUDA toolkit or set `CUDA_HOME` environment variable |
| `[FAIL] nvcc missing` | nvcc compiler not in PATH | Install CUDA toolkit; ensure `$CUDA_HOME/bin` is in PATH |
| `/dev/shm size might be too small` | Shared memory < 512MB | In Docker, use `--shm-size='1gb'` or larger |
| `ninja not found` | ninja build tool missing | `pip install ninja` or `apt install ninja-build` |
Compatibility Notes
- AMD ROCm (HIP): DeepSpeed supports AMD GPUs through the ROCm/HIP stack. Triton is explicitly skipped on ROCm due to `pytorch-triton-rocm` breaking device API. Use `DS_ACCELERATOR=cuda` (ROCm maps through CUDA-like interface).
- Windows: Uses Gloo backend instead of NCCL. Ops are pre-compiled at install time rather than JIT compiled. Requires Visual C++ Build Tools.
- Docker: Default `/dev/shm` size (64MB) is insufficient for NCCL. Always set `--shm-size='1gb'` or larger.
- pynvml: Automatically remaps device IDs when `CUDA_VISIBLE_DEVICES` is set, since pynvml ignores this env var.
Related Pages
- Implementation:Deepspeedai_DeepSpeed_Initialize
- Implementation:Deepspeedai_DeepSpeed_Ds_Report_Main
- Implementation:Deepspeedai_DeepSpeed_Init_Inference
- Implementation:Deepspeedai_DeepSpeed_HybridEngine_Init
- Implementation:Deepspeedai_DeepSpeed_AutoTP_Replace
- Implementation:Deepspeedai_DeepSpeed_LinearAllreduce_Forward