Implementation:Deepspeedai DeepSpeed CUDA Accelerator

Knowledge Sources	DeepSpeed
Domains	Accelerator, GPU Backend
Last Updated	2026-02-09 00:00 GMT

Overview

NVIDIA CUDA GPU backend providing the primary and most fully-featured accelerator implementation for DeepSpeed.

Description

The CUDA_Accelerator class wraps torch.cuda APIs to implement the DeepSpeedAccelerator interface for NVIDIA GPUs. It uses nccl for multi-GPU communication (falling back to gloo on Windows) and supports advanced features including CUDA graphs via torch.cuda.CUDAGraph, NVTX profiling ranges, and torch.cuda.amp for mixed precision. The implementation initializes pynvml for accurate available memory queries that respect CUDA_VISIBLE_DEVICES remapping. FP16 support is gated on compute capability >= 7.0, with an optional DS_ALLOW_DEPRECATED_FP16 override for 6.x. Triton JIT compilation requires compute capability >= 8.0. Op builders are lazily discovered by scanning the op_builder module directory for classes ending in 'Builder'.

Usage

The default accelerator when NVIDIA GPUs are detected. Provides full DeepSpeed functionality including ZeRO optimization, pipeline parallelism, and custom CUDA kernels.

Code Reference

Source Location

Repository: DeepSpeed
File: accelerator/cuda_accelerator.py

Signature

class CUDA_Accelerator(DeepSpeedAccelerator):
    def __init__(self):
        self._name = 'cuda'
        self._communication_backend_name = 'nccl' if sys.platform != 'win32' else 'gloo'
        self._compile_backend = "inductor"
        self._init_pynvml()

    def is_synchronized_device(self):
        return False

    def device_name(self, device_index=None):
        if device_index is None:
            return 'cuda'
        return f'cuda:{device_index}'

    def is_fp16_supported(self):
        major, _ = torch.cuda.get_device_capability()
        return major >= 7 or (major == 6 and allow_deprecated_fp16)

    def is_triton_supported(self):
        major, _ = torch.cuda.get_device_capability()
        return major >= 8

    def available_memory(self, device_index=None):
        if pynvml:
            handle = pynvml.nvmlDeviceGetHandleByIndex(
                self._get_nvml_gpu_id(device_index))
            return pynvml.nvmlDeviceGetMemoryInfo(handle).free

    def create_graph(self):
        return torch.cuda.CUDAGraph()

    def _lazy_init_class_dict(self):
        # Lazily discover all *Builder classes in op_builder module

Import

from deepspeed.accelerator.cuda_accelerator import CUDA_Accelerator

I/O Contract

Inputs

Name	Type	Required	Description
device_index	int	Optional	CUDA device index (0-N)
seed	int	Required	Random seed for CUDA RNG
graph	CUDAGraph	Required	Graph to capture/replay

Outputs

Name	Type	Description
device	torch.device	CUDA device object
device_count	int	Number of CUDA devices
memory_bytes	int	GPU memory in bytes
is_available	bool	Whether CUDA is available
compute_capability	tuple	(major, minor) compute capability

Usage Examples

from deepspeed.accelerator import get_accelerator

# Get CUDA accelerator (auto-detected if NVIDIA GPU present)
accelerator = get_accelerator()

# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
print(f"Current device: {accelerator.current_device_name()}")

# Memory queries
total = accelerator.total_memory(0)
available = accelerator.available_memory(0)
allocated = accelerator.memory_allocated(0)
print(f"Memory: {allocated}/{total} bytes used")

# Check capabilities
print(f"FP16 supported: {accelerator.is_fp16_supported()}")
print(f"BF16 supported: {accelerator.is_bf16_supported()}")
print(f"Triton supported: {accelerator.is_triton_supported()}")

# CUDA graphs
graph = accelerator.create_graph()
with accelerator.capture_to_graph(graph):
    # Operations to capture
    output = model(input_tensor)
accelerator.replay_graph(graph)

# Get op builders
transformer_builder = accelerator.get_op_builder('TransformerBuilder')
sparse_attn_builder = accelerator.get_op_builder('SparseAttnBuilder')

Related Pages

Abstract Accelerator - Base interface
Real Accelerator - Accelerator detection
CPU Accelerator - CPU fallback

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment