Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Deepspeedai DeepSpeed NVMe Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Storage, NVMe_Offload
Last Updated 2026-02-09 00:00 GMT

Overview

NVMe storage environment for DeepSpeed's asynchronous I/O subsystem, enabling optimizer state and parameter offloading to fast NVMe SSDs.

Description

This environment provides the NVMe storage backend required by DeepSpeed's async I/O (AIO) subsystem. The AIO system uses Linux's native asynchronous I/O interface (libaio) to perform high-throughput tensor reads and writes to NVMe devices. This is essential for ZeRO-Infinity and ZeRO-Offload scenarios where optimizer states, gradients, or parameters are swapped between GPU/CPU memory and NVMe storage.

The environment requires Linux with the `libaio` library installed, and an NVMe SSD mounted with appropriate permissions. The AIO subsystem validates the storage device's capabilities at initialization time, including checking for direct I/O support and measuring achievable bandwidth.

Usage

Use this environment when training models with ZeRO-Infinity (Stage 3 with NVMe offload) or when using DeepSpeed's parameter/optimizer state swapping to NVMe. Required for any workflow that sets `offload_optimizer.device: "nvme"` or `offload_param.device: "nvme"` in the DeepSpeed configuration.

System Requirements

Category Requirement Notes
OS Linux Windows not supported for AIO
Storage NVMe SSD Mounted with read/write permissions; direct I/O support recommended
Library `libaio` (libaio-dev) Linux asynchronous I/O library; required for kernel AIO operations
Filesystem ext4, xfs recommended Must support O_DIRECT for optimal performance
Permissions Read/write access to NVMe mount point User must have permissions on the offload directory

Dependencies

System Packages

  • `libaio-dev` (Debian/Ubuntu) or `libaio-devel` (RHEL/CentOS) - Linux AIO library headers
  • `libaio1` (runtime library)

Python Packages

  • `torch` (with CPU or CUDA support)
  • `deepspeed` (with AIO op builder compiled)

Credentials

The following environment variables affect NVMe I/O behavior:

  • `DLTS_HOSTFILE`: Used in distributed settings to coordinate NVMe paths across nodes
  • DeepSpeed config `aio` section controls: `block_size`, `queue_depth`, `thread_count`, `single_submit`, `overlap_events`

Quick Install

# Install libaio development headers
sudo apt-get install libaio-dev  # Debian/Ubuntu
# sudo yum install libaio-devel  # RHEL/CentOS

# Install DeepSpeed (AIO ops are JIT compiled on first use)
pip install deepspeed

# Verify AIO support
ds_report | grep aio

Code Evidence

AIO operation modes from `csrc/aio/common/deepspeed_aio_common.cpp`:

// Sequential I/O: submit batch, wait for completion, repeat
static int _do_io_sequential(const long long int n_iocbs, struct iocb** iocbs,
                              io_context_t aio_ctxt, int n_completions) {
    // Submit all iocbs then wait for all completions
}

// Overlap I/O: maintain full queue depth by overlapping submit and complete
static int _do_io_overlap(const long long int n_iocbs, struct iocb** iocbs,
                           io_context_t aio_ctxt, int n_completions) {
    // Submit initial batch, then overlap completion tracking with new submissions
}

Common Errors

Error Message Cause Solution
`libaio not found` libaio-dev not installed `sudo apt-get install libaio-dev`
`AIO op builder failed` Missing libaio headers or incompatible compiler Install libaio-dev and ensure gcc/g++ is available
`Permission denied on NVMe path` Insufficient permissions on offload directory Check mount permissions and user access
`O_DIRECT not supported` Filesystem does not support direct I/O Use ext4 or xfs filesystem; check mount options

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment