Environment:Deepspeedai DeepSpeed CUDA GPU Environment

Knowledge Sources	DeepSpeed NVIDIA CUDA Toolkit
Domains	Infrastructure, Deep_Learning, Distributed_Training
Last Updated	2026-02-09 00:00 GMT

Overview

NVIDIA CUDA GPU environment with NCCL communication backend, required for GPU-accelerated DeepSpeed training and inference operations.

Description

This environment provides the primary GPU-accelerated context for DeepSpeed. It requires an NVIDIA GPU with CUDA support, the NCCL communication backend for distributed operations, and optionally pynvml (nvidia-ml-py) for GPU memory monitoring. The CUDA accelerator is automatically detected when `torch.cuda.device_count() > 0` and `torch.cuda.is_available()` return true, or can be forced via the `DS_ACCELERATOR=cuda` environment variable.

FP16 support requires compute capability >= 7.0 (Volta or newer). Compute capability 6.x (Pascal) FP16 is deprecated and requires setting `DS_ALLOW_DEPRECATED_FP16=1`. BF16 support requires Ampere (compute capability >= 8.0) or newer. Triton kernels require compute capability >= 8.0 (Ampere+).

Usage

Use this environment for any DeepSpeed training or inference workflow that requires GPU acceleration. This is the mandatory prerequisite for running ZeRO-optimized training, tensor parallelism, pipeline parallelism, hybrid engine RLHF, and inference engine optimization. The CUDA accelerator uses NCCL as its communication backend on Linux and Gloo on Windows.

System Requirements

Category	Requirement	Notes
OS	Linux (recommended), Windows (limited)	Windows uses Gloo backend instead of NCCL; ops are pre-compiled on Windows vs JIT on Linux
Hardware	NVIDIA GPU with CUDA support	Compute capability >= 7.0 for FP16; >= 8.0 for BF16 and Triton
GPU Memory	Varies by workload	Minimum depends on model size, ZeRO stage, and offload configuration
Shared Memory	/dev/shm >= 1GB recommended	Required for NCCL; Docker default may be too small (see `--shm-size`)
CUDA Toolkit	CUDA 11.x or newer	Must match PyTorch CUDA version; nvcc required for JIT compilation
Communication	NCCL (Linux) / Gloo (Windows)	NCCL required for multi-GPU distributed training on Linux

Dependencies

System Packages

`cuda-toolkit` (matching PyTorch CUDA version)
`nvidia-driver` (compatible with CUDA toolkit)
`ninja` (for JIT compilation of C++/CUDA ops)
`nccl` (bundled with PyTorch on Linux)

Python Packages

`torch` (with CUDA support)
`nvidia-ml-py` (pynvml, auto-installed for NVIDIA GPUs; not installed for ROCm)

Optional Python Packages

`triton` >= 2.1.0 (for Triton-based kernels; requires compute capability >= 8.0)
`apex` (for AMP mixed precision; alternative to native PyTorch AMP)

Credentials

The following environment variables control CUDA accelerator behavior:

`DS_ACCELERATOR`: Override accelerator detection. Set to `cuda` to force CUDA backend.
`CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible to the process. pynvml auto-remaps device IDs.
`TORCH_CUDA_ARCH_LIST`: Set compute capabilities for JIT compilation (auto-configured if not set).
`DS_BUILD_OPS`: Set to `1` to pre-compile C++/CUDA ops at install time (default: JIT on Linux, pre-compile on Windows).
`DS_ENABLE_NINJA`: Set to enable ninja build system for pre-installed ops.
`DS_ALLOW_DEPRECATED_FP16`: Set to `1` to allow FP16 on Pascal GPUs (compute capability 6.x).

Quick Install

# Install PyTorch with CUDA support (example for CUDA 12.1)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install DeepSpeed (ops are JIT compiled on first use on Linux)
pip install deepspeed

# Verify environment
ds_report

Code Evidence

Accelerator auto-detection from `accelerator/real_accelerator.py:191-202`:

if accelerator_name is None:
    try:
        import torch
        if torch.cuda.device_count() > 0 and torch.cuda.is_available():  #ignore-cuda
            accelerator_name = "cuda"
    except (RuntimeError, ImportError) as e:
        pass

FP16 compute capability check from `accelerator/cuda_accelerator.py:202-214`:

def is_fp16_supported(self):
    if not torch.cuda.is_available():
        return True
    allow_deprecated_fp16 = os.environ.get('DS_ALLOW_DEPRECATED_FP16', '0') == '1'
    major, _ = torch.cuda.get_device_capability()
    if major >= 7:
        return True
    elif major == 6 and allow_deprecated_fp16:
        return True
    else:
        return False

Triton support check from `accelerator/cuda_accelerator.py:247-254`:

def is_triton_supported(self):
    if not self.is_available():
        return False
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        return True
    else:
        return False

NCCL communication backend from `accelerator/cuda_accelerator.py:28`:

self._communication_backend_name = 'nccl' if sys.platform != 'win32' else 'gloo'

Shared memory check from `deepspeed/env_report.py:103-120`:

def get_shm_size():
    shm_stats = os.statvfs('/dev/shm')
    shm_size = shm_stats.f_frsize * shm_stats.f_blocks
    if shm_size < 512 * 1024**2:
        warn.append(
            " [WARNING] /dev/shm size might be too small, if running in docker "
            "increase to at least --shm-size='1gb'"
        )

Common Errors

Error Message	Cause	Solution
`DS_ACCELERATOR must be one of [...]`	Invalid accelerator name in DS_ACCELERATOR env var	Set DS_ACCELERATOR to one of: cuda, cpu, xpu, xpu.external, npu, mps, hpu, mlu, sdaa
`Setting accelerator to CPU` (warning)	No GPU detected	Ensure NVIDIA drivers and CUDA toolkit are installed; check `nvidia-smi`
`[FAIL] cannot find CUDA_HOME`	CUDA toolkit not found	Install CUDA toolkit or set `CUDA_HOME` environment variable
`[FAIL] nvcc missing`	nvcc compiler not in PATH	Install CUDA toolkit; ensure `$CUDA_HOME/bin` is in PATH
`/dev/shm size might be too small`	Shared memory < 512MB	In Docker, use `--shm-size='1gb'` or larger
`ninja not found`	ninja build tool missing	`pip install ninja` or `apt install ninja-build`

Compatibility Notes

AMD ROCm (HIP): DeepSpeed supports AMD GPUs through the ROCm/HIP stack. Triton is explicitly skipped on ROCm due to `pytorch-triton-rocm` breaking device API. Use `DS_ACCELERATOR=cuda` (ROCm maps through CUDA-like interface).
Windows: Uses Gloo backend instead of NCCL. Ops are pre-compiled at install time rather than JIT compiled. Requires Visual C++ Build Tools.
Docker: Default `/dev/shm` size (64MB) is insufficient for NCCL. Always set `--shm-size='1gb'` or larger.
pynvml: Automatically remaps device IDs when `CUDA_VISIBLE_DEVICES` is set, since pynvml ignores this env var.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment