Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Deepspeedai DeepSpeed CUDA Accelerator

From Leeroopedia


Knowledge Sources
Domains Accelerator, GPU Backend
Last Updated 2026-02-09 00:00 GMT

Overview

NVIDIA CUDA GPU backend providing the primary and most fully-featured accelerator implementation for DeepSpeed.

Description

The CUDA_Accelerator class wraps torch.cuda APIs to implement the DeepSpeedAccelerator interface for NVIDIA GPUs. It uses nccl for multi-GPU communication (falling back to gloo on Windows) and supports advanced features including CUDA graphs via torch.cuda.CUDAGraph, NVTX profiling ranges, and torch.cuda.amp for mixed precision. The implementation initializes pynvml for accurate available memory queries that respect CUDA_VISIBLE_DEVICES remapping. FP16 support is gated on compute capability >= 7.0, with an optional DS_ALLOW_DEPRECATED_FP16 override for 6.x. Triton JIT compilation requires compute capability >= 8.0. Op builders are lazily discovered by scanning the op_builder module directory for classes ending in 'Builder'.

Usage

The default accelerator when NVIDIA GPUs are detected. Provides full DeepSpeed functionality including ZeRO optimization, pipeline parallelism, and custom CUDA kernels.

Code Reference

Source Location

Signature

class CUDA_Accelerator(DeepSpeedAccelerator):
    def __init__(self):
        self._name = 'cuda'
        self._communication_backend_name = 'nccl' if sys.platform != 'win32' else 'gloo'
        self._compile_backend = "inductor"
        self._init_pynvml()

    def is_synchronized_device(self):
        return False

    def device_name(self, device_index=None):
        if device_index is None:
            return 'cuda'
        return f'cuda:{device_index}'

    def is_fp16_supported(self):
        major, _ = torch.cuda.get_device_capability()
        return major >= 7 or (major == 6 and allow_deprecated_fp16)

    def is_triton_supported(self):
        major, _ = torch.cuda.get_device_capability()
        return major >= 8

    def available_memory(self, device_index=None):
        if pynvml:
            handle = pynvml.nvmlDeviceGetHandleByIndex(
                self._get_nvml_gpu_id(device_index))
            return pynvml.nvmlDeviceGetMemoryInfo(handle).free

    def create_graph(self):
        return torch.cuda.CUDAGraph()

    def _lazy_init_class_dict(self):
        # Lazily discover all *Builder classes in op_builder module

Import

from deepspeed.accelerator.cuda_accelerator import CUDA_Accelerator

I/O Contract

Inputs

Name Type Required Description
device_index int Optional CUDA device index (0-N)
seed int Required Random seed for CUDA RNG
graph CUDAGraph Required Graph to capture/replay

Outputs

Name Type Description
device torch.device CUDA device object
device_count int Number of CUDA devices
memory_bytes int GPU memory in bytes
is_available bool Whether CUDA is available
compute_capability tuple (major, minor) compute capability

Usage Examples

from deepspeed.accelerator import get_accelerator

# Get CUDA accelerator (auto-detected if NVIDIA GPU present)
accelerator = get_accelerator()

# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
print(f"Current device: {accelerator.current_device_name()}")

# Memory queries
total = accelerator.total_memory(0)
available = accelerator.available_memory(0)
allocated = accelerator.memory_allocated(0)
print(f"Memory: {allocated}/{total} bytes used")

# Check capabilities
print(f"FP16 supported: {accelerator.is_fp16_supported()}")
print(f"BF16 supported: {accelerator.is_bf16_supported()}")
print(f"Triton supported: {accelerator.is_triton_supported()}")

# CUDA graphs
graph = accelerator.create_graph()
with accelerator.capture_to_graph(graph):
    # Operations to capture
    output = model(input_tensor)
accelerator.replay_graph(graph)

# Get op builders
transformer_builder = accelerator.get_op_builder('TransformerBuilder')
sparse_attn_builder = accelerator.get_op_builder('SparseAttnBuilder')

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment