Principle:Deepspeedai DeepSpeed Async IO Operations
| Knowledge Sources | |
|---|---|
| Domains | Storage_IO, Tensor_Offloading, NVMe |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
High-bandwidth asynchronous I/O subsystem for swapping tensors between GPU/CPU memory and NVMe storage devices during ZeRO-Infinity and ZeRO-Offload training.
Description
Async IO Operations enable DeepSpeed to extend the memory hierarchy beyond GPU and CPU DRAM to include NVMe solid-state drives. This is the storage backbone of ZeRO-Infinity, which offloads parameters, gradients, and optimizer states to NVMe when CPU memory is also insufficient for extremely large models.
The subsystem provides several key capabilities:
- Linux AIO integration: Uses the Linux kernel's asynchronous I/O (libaio) interface to issue non-blocking read and write requests to NVMe devices, allowing computation to overlap with storage transfers
- GPUDirect Storage (GDS): Optionally bypasses CPU memory entirely by using NVIDIA GDS to transfer data directly between GPU memory and NVMe storage via PCIe DMA
- Pinned memory management: Manages page-locked (pinned) CPU memory buffers that enable efficient DMA transfers between devices
- SIMD-accelerated memory copies: Uses vectorized memcpy operations for high-throughput CPU memory transfers during tensor staging
- Handle-based API: A Python-exposed handle abstraction (aio_handle, gds_handle) that manages I/O queues, submission, and completion for both AIO and GDS paths
- Benchmarking utilities: Performance sweep tools for tuning I/O parameters (queue depth, block size, parallelism) to maximize throughput on specific hardware configurations
The subsystem is compiled as a C++ extension via DeepSpeed's OpBuilder system and exposed through pybind11 bindings.
Usage
Enable NVMe offloading by setting offload_param.device or offload_optimizer.device to "nvme" in the DeepSpeed configuration and providing an offload_param.nvme_path or offload_optimizer.nvme_path pointing to a fast NVMe mount. DeepSpeed automatically manages the async I/O handles for tensor swapping. Use the ds_aio benchmark tool to tune I/O parameters for your specific NVMe hardware.
Theoretical Basis
Asynchronous I/O decouples the issuing of storage requests from their completion, allowing the CPU and GPU to continue computation while data transfers proceed in the background. This is critical for hiding the latency gap between NVMe storage (microseconds) and GPU computation (nanoseconds).
Bandwidth hierarchy:
- GPU HBM: 1-3 TB/s
- CPU DDR: 100-400 GB/s
- NVMe SSD: 3-14 GB/s (per device, higher with RAID)
- Network: 25-400 Gbps
Pipeline overlap model: The key insight is that NVMe reads for the next iteration's parameters can overlap with the current iteration's computation:
# Abstract async I/O pipeline for ZeRO-Infinity
class AsyncIOHandle:
def __init__(self, block_size, queue_depth, single_submit, overlap_events):
self.aio_context = create_aio_context(queue_depth)
self.pin_buffer = allocate_pinned_memory(block_size * queue_depth)
def async_read(self, tensor, filename, blocking=False):
submit_aio_read(self.aio_context, self.pin_buffer, filename)
if blocking:
wait_for_completion(self.aio_context)
copy_to_tensor(self.pin_buffer, tensor)
def async_write(self, tensor, filename, blocking=False):
copy_from_tensor(tensor, self.pin_buffer)
submit_aio_write(self.aio_context, self.pin_buffer, filename)
if blocking:
wait_for_completion(self.aio_context)
# Training loop with overlapped NVMe I/O
for step in training_steps:
# Prefetch next step's parameters from NVMe (async)
handle.async_read(next_params, nvme_path(step + 1))
# Compute on current parameters (overlaps with I/O)
loss = forward(current_params, batch)
loss.backward()
# Wait for prefetch to complete
handle.wait()
# Offload current gradients to NVMe (async)
handle.async_write(grads, nvme_grad_path(step))
Queue depth tuning: Higher queue depths allow the NVMe controller to reorder and parallelize requests internally, improving throughput. However, excessive queue depth increases memory consumption for pinned buffers. The optimal queue depth depends on the specific NVMe device and workload pattern.
Related Pages
Implemented By
- Implementation:Deepspeedai_DeepSpeed_AIO_Common — Shared AIO constants, structures, and utility functions
- Implementation:Deepspeedai_DeepSpeed_IO_Handle — Core AIO handle managing read/write queues and completion
- Implementation:Deepspeedai_DeepSpeed_AIO_Bench_Perf_Sweep — NVMe performance benchmarking and parameter sweep tool
- Implementation:Deepspeedai_DeepSpeed_AIO_Utils — AIO helper utilities for buffer management
- Implementation:Deepspeedai_DeepSpeed_CPU_IO_Op — CPU-side I/O operation for synchronous fallback
- Implementation:Deepspeedai_DeepSpeed_Pin_Tensor — Pinned memory tensor allocation and management
- Implementation:Deepspeedai_DeepSpeed_Py_AIO — Python bindings for async I/O read/write operations
- Implementation:Deepspeedai_DeepSpeed_Py_Copy — Python bindings for SIMD-accelerated memory copy
- Implementation:Deepspeedai_DeepSpeed_IO_Handle_Interface — Abstract handle interface for AIO and GDS backends
- Implementation:Deepspeedai_DeepSpeed_Py_DS_AIO_Module — Top-level pybind11 module exposing the AIO subsystem
- Implementation:Deepspeedai_DeepSpeed_GDS_Op — NVIDIA GPUDirect Storage operation for direct GPU-NVMe transfers