Principle:Deepspeedai DeepSpeed Async IO Operations

Knowledge Sources	DeepSpeed ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
Domains	Storage_IO, Tensor_Offloading, NVMe
Last Updated	2026-02-09 00:00 GMT

Overview

High-bandwidth asynchronous I/O subsystem for swapping tensors between GPU/CPU memory and NVMe storage devices during ZeRO-Infinity and ZeRO-Offload training.

Description

Async IO Operations enable DeepSpeed to extend the memory hierarchy beyond GPU and CPU DRAM to include NVMe solid-state drives. This is the storage backbone of ZeRO-Infinity, which offloads parameters, gradients, and optimizer states to NVMe when CPU memory is also insufficient for extremely large models.

The subsystem provides several key capabilities:

Linux AIO integration: Uses the Linux kernel's asynchronous I/O (libaio) interface to issue non-blocking read and write requests to NVMe devices, allowing computation to overlap with storage transfers
GPUDirect Storage (GDS): Optionally bypasses CPU memory entirely by using NVIDIA GDS to transfer data directly between GPU memory and NVMe storage via PCIe DMA
Pinned memory management: Manages page-locked (pinned) CPU memory buffers that enable efficient DMA transfers between devices
SIMD-accelerated memory copies: Uses vectorized memcpy operations for high-throughput CPU memory transfers during tensor staging
Handle-based API: A Python-exposed handle abstraction (aio_handle, gds_handle) that manages I/O queues, submission, and completion for both AIO and GDS paths
Benchmarking utilities: Performance sweep tools for tuning I/O parameters (queue depth, block size, parallelism) to maximize throughput on specific hardware configurations

The subsystem is compiled as a C++ extension via DeepSpeed's OpBuilder system and exposed through pybind11 bindings.

Usage

Enable NVMe offloading by setting offload_param.device or offload_optimizer.device to "nvme" in the DeepSpeed configuration and providing an offload_param.nvme_path or offload_optimizer.nvme_path pointing to a fast NVMe mount. DeepSpeed automatically manages the async I/O handles for tensor swapping. Use the ds_aio benchmark tool to tune I/O parameters for your specific NVMe hardware.

Theoretical Basis

Asynchronous I/O decouples the issuing of storage requests from their completion, allowing the CPU and GPU to continue computation while data transfers proceed in the background. This is critical for hiding the latency gap between NVMe storage (microseconds) and GPU computation (nanoseconds).

Bandwidth hierarchy:

GPU HBM: 1-3 TB/s
CPU DDR: 100-400 GB/s
NVMe SSD: 3-14 GB/s (per device, higher with RAID)
Network: 25-400 Gbps

Pipeline overlap model: The key insight is that NVMe reads for the next iteration's parameters can overlap with the current iteration's computation:

# Abstract async I/O pipeline for ZeRO-Infinity
class AsyncIOHandle:
    def __init__(self, block_size, queue_depth, single_submit, overlap_events):
        self.aio_context = create_aio_context(queue_depth)
        self.pin_buffer = allocate_pinned_memory(block_size * queue_depth)

    def async_read(self, tensor, filename, blocking=False):
        submit_aio_read(self.aio_context, self.pin_buffer, filename)
        if blocking:
            wait_for_completion(self.aio_context)
            copy_to_tensor(self.pin_buffer, tensor)

    def async_write(self, tensor, filename, blocking=False):
        copy_from_tensor(tensor, self.pin_buffer)
        submit_aio_write(self.aio_context, self.pin_buffer, filename)
        if blocking:
            wait_for_completion(self.aio_context)

# Training loop with overlapped NVMe I/O
for step in training_steps:
    # Prefetch next step's parameters from NVMe (async)
    handle.async_read(next_params, nvme_path(step + 1))

    # Compute on current parameters (overlaps with I/O)
    loss = forward(current_params, batch)
    loss.backward()

    # Wait for prefetch to complete
    handle.wait()

    # Offload current gradients to NVMe (async)
    handle.async_write(grads, nvme_grad_path(step))

Queue depth tuning: Higher queue depths allow the NVMe controller to reorder and parallelize requests internally, improving throughput. However, excessive queue depth increases memory consumption for pinned buffers. The optimal queue depth depends on the specific NVMe device and workload pattern.

Related Pages

Implemented By

Implementation:Deepspeedai_DeepSpeed_AIO_Common — Shared AIO constants, structures, and utility functions
Implementation:Deepspeedai_DeepSpeed_IO_Handle — Core AIO handle managing read/write queues and completion
Implementation:Deepspeedai_DeepSpeed_AIO_Bench_Perf_Sweep — NVMe performance benchmarking and parameter sweep tool
Implementation:Deepspeedai_DeepSpeed_AIO_Utils — AIO helper utilities for buffer management
Implementation:Deepspeedai_DeepSpeed_CPU_IO_Op — CPU-side I/O operation for synchronous fallback
Implementation:Deepspeedai_DeepSpeed_Pin_Tensor — Pinned memory tensor allocation and management
Implementation:Deepspeedai_DeepSpeed_Py_AIO — Python bindings for async I/O read/write operations
Implementation:Deepspeedai_DeepSpeed_Py_Copy — Python bindings for SIMD-accelerated memory copy
Implementation:Deepspeedai_DeepSpeed_IO_Handle_Interface — Abstract handle interface for AIO and GDS backends
Implementation:Deepspeedai_DeepSpeed_Py_DS_AIO_Module — Top-level pybind11 module exposing the AIO subsystem
Implementation:Deepspeedai_DeepSpeed_GDS_Op — NVIDIA GPUDirect Storage operation for direct GPU-NVMe transfers

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment