Implementation:Deepspeedai DeepSpeed CUDA Accelerator
| Knowledge Sources | |
|---|---|
| Domains | Accelerator, GPU Backend |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
NVIDIA CUDA GPU backend providing the primary and most fully-featured accelerator implementation for DeepSpeed.
Description
The CUDA_Accelerator class wraps torch.cuda APIs to implement the DeepSpeedAccelerator interface for NVIDIA GPUs. It uses nccl for multi-GPU communication (falling back to gloo on Windows) and supports advanced features including CUDA graphs via torch.cuda.CUDAGraph, NVTX profiling ranges, and torch.cuda.amp for mixed precision. The implementation initializes pynvml for accurate available memory queries that respect CUDA_VISIBLE_DEVICES remapping. FP16 support is gated on compute capability >= 7.0, with an optional DS_ALLOW_DEPRECATED_FP16 override for 6.x. Triton JIT compilation requires compute capability >= 8.0. Op builders are lazily discovered by scanning the op_builder module directory for classes ending in 'Builder'.
Usage
The default accelerator when NVIDIA GPUs are detected. Provides full DeepSpeed functionality including ZeRO optimization, pipeline parallelism, and custom CUDA kernels.
Code Reference
Source Location
- Repository: DeepSpeed
- File: accelerator/cuda_accelerator.py
Signature
class CUDA_Accelerator(DeepSpeedAccelerator):
def __init__(self):
self._name = 'cuda'
self._communication_backend_name = 'nccl' if sys.platform != 'win32' else 'gloo'
self._compile_backend = "inductor"
self._init_pynvml()
def is_synchronized_device(self):
return False
def device_name(self, device_index=None):
if device_index is None:
return 'cuda'
return f'cuda:{device_index}'
def is_fp16_supported(self):
major, _ = torch.cuda.get_device_capability()
return major >= 7 or (major == 6 and allow_deprecated_fp16)
def is_triton_supported(self):
major, _ = torch.cuda.get_device_capability()
return major >= 8
def available_memory(self, device_index=None):
if pynvml:
handle = pynvml.nvmlDeviceGetHandleByIndex(
self._get_nvml_gpu_id(device_index))
return pynvml.nvmlDeviceGetMemoryInfo(handle).free
def create_graph(self):
return torch.cuda.CUDAGraph()
def _lazy_init_class_dict(self):
# Lazily discover all *Builder classes in op_builder module
Import
from deepspeed.accelerator.cuda_accelerator import CUDA_Accelerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| device_index | int | Optional | CUDA device index (0-N) |
| seed | int | Required | Random seed for CUDA RNG |
| graph | CUDAGraph | Required | Graph to capture/replay |
Outputs
| Name | Type | Description |
|---|---|---|
| device | torch.device | CUDA device object |
| device_count | int | Number of CUDA devices |
| memory_bytes | int | GPU memory in bytes |
| is_available | bool | Whether CUDA is available |
| compute_capability | tuple | (major, minor) compute capability |
Usage Examples
from deepspeed.accelerator import get_accelerator
# Get CUDA accelerator (auto-detected if NVIDIA GPU present)
accelerator = get_accelerator()
# Device management
print(f"Device count: {accelerator.device_count()}")
accelerator.set_device(0)
print(f"Current device: {accelerator.current_device_name()}")
# Memory queries
total = accelerator.total_memory(0)
available = accelerator.available_memory(0)
allocated = accelerator.memory_allocated(0)
print(f"Memory: {allocated}/{total} bytes used")
# Check capabilities
print(f"FP16 supported: {accelerator.is_fp16_supported()}")
print(f"BF16 supported: {accelerator.is_bf16_supported()}")
print(f"Triton supported: {accelerator.is_triton_supported()}")
# CUDA graphs
graph = accelerator.create_graph()
with accelerator.capture_to_graph(graph):
# Operations to capture
output = model(input_tensor)
accelerator.replay_graph(graph)
# Get op builders
transformer_builder = accelerator.get_op_builder('TransformerBuilder')
sparse_attn_builder = accelerator.get_op_builder('SparseAttnBuilder')
Related Pages
- Abstract Accelerator - Base interface
- Real Accelerator - Accelerator detection
- CPU Accelerator - CPU fallback