Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Deepspeedai DeepSpeed Async IO Operations

From Leeroopedia


Knowledge Sources
Domains Storage_IO, Tensor_Offloading, NVMe
Last Updated 2026-02-09 00:00 GMT

Overview

High-bandwidth asynchronous I/O subsystem for swapping tensors between GPU/CPU memory and NVMe storage devices during ZeRO-Infinity and ZeRO-Offload training.

Description

Async IO Operations enable DeepSpeed to extend the memory hierarchy beyond GPU and CPU DRAM to include NVMe solid-state drives. This is the storage backbone of ZeRO-Infinity, which offloads parameters, gradients, and optimizer states to NVMe when CPU memory is also insufficient for extremely large models.

The subsystem provides several key capabilities:

  • Linux AIO integration: Uses the Linux kernel's asynchronous I/O (libaio) interface to issue non-blocking read and write requests to NVMe devices, allowing computation to overlap with storage transfers
  • GPUDirect Storage (GDS): Optionally bypasses CPU memory entirely by using NVIDIA GDS to transfer data directly between GPU memory and NVMe storage via PCIe DMA
  • Pinned memory management: Manages page-locked (pinned) CPU memory buffers that enable efficient DMA transfers between devices
  • SIMD-accelerated memory copies: Uses vectorized memcpy operations for high-throughput CPU memory transfers during tensor staging
  • Handle-based API: A Python-exposed handle abstraction (aio_handle, gds_handle) that manages I/O queues, submission, and completion for both AIO and GDS paths
  • Benchmarking utilities: Performance sweep tools for tuning I/O parameters (queue depth, block size, parallelism) to maximize throughput on specific hardware configurations

The subsystem is compiled as a C++ extension via DeepSpeed's OpBuilder system and exposed through pybind11 bindings.

Usage

Enable NVMe offloading by setting offload_param.device or offload_optimizer.device to "nvme" in the DeepSpeed configuration and providing an offload_param.nvme_path or offload_optimizer.nvme_path pointing to a fast NVMe mount. DeepSpeed automatically manages the async I/O handles for tensor swapping. Use the ds_aio benchmark tool to tune I/O parameters for your specific NVMe hardware.

Theoretical Basis

Asynchronous I/O decouples the issuing of storage requests from their completion, allowing the CPU and GPU to continue computation while data transfers proceed in the background. This is critical for hiding the latency gap between NVMe storage (microseconds) and GPU computation (nanoseconds).

Bandwidth hierarchy:

  • GPU HBM: 1-3 TB/s
  • CPU DDR: 100-400 GB/s
  • NVMe SSD: 3-14 GB/s (per device, higher with RAID)
  • Network: 25-400 Gbps

Pipeline overlap model: The key insight is that NVMe reads for the next iteration's parameters can overlap with the current iteration's computation:

# Abstract async I/O pipeline for ZeRO-Infinity
class AsyncIOHandle:
    def __init__(self, block_size, queue_depth, single_submit, overlap_events):
        self.aio_context = create_aio_context(queue_depth)
        self.pin_buffer = allocate_pinned_memory(block_size * queue_depth)

    def async_read(self, tensor, filename, blocking=False):
        submit_aio_read(self.aio_context, self.pin_buffer, filename)
        if blocking:
            wait_for_completion(self.aio_context)
            copy_to_tensor(self.pin_buffer, tensor)

    def async_write(self, tensor, filename, blocking=False):
        copy_from_tensor(tensor, self.pin_buffer)
        submit_aio_write(self.aio_context, self.pin_buffer, filename)
        if blocking:
            wait_for_completion(self.aio_context)

# Training loop with overlapped NVMe I/O
for step in training_steps:
    # Prefetch next step's parameters from NVMe (async)
    handle.async_read(next_params, nvme_path(step + 1))

    # Compute on current parameters (overlaps with I/O)
    loss = forward(current_params, batch)
    loss.backward()

    # Wait for prefetch to complete
    handle.wait()

    # Offload current gradients to NVMe (async)
    handle.async_write(grads, nvme_grad_path(step))

Queue depth tuning: Higher queue depths allow the NVMe controller to reorder and parallelize requests internally, improving throughput. However, excessive queue depth increases memory consumption for pinned buffers. The optimal queue depth depends on the specific NVMe device and workload pattern.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment