Principle:FMInference FlexLLMGen Async IO NVMe Operations

Knowledge Sources	FMInference_FlexLLMGen
Domains	Async IO, NVMe Storage, Systems Programming
Last Updated	2026-02-09 12:00 GMT

Overview

Asynchronous, direct I/O techniques for transferring large tensors between host memory and NVMe storage devices with minimal kernel overhead and predictable throughput.

Description

When model parameters or optimizer states must be swapped to NVMe storage, the I/O layer must achieve near-hardware-peak bandwidth while avoiding interference with ongoing GPU computation. This requires three key techniques working together: direct I/O (bypassing the OS page cache), asynchronous submission (decoupling I/O requests from their completion), and operation pipelining (overlapping new submissions with in-flight completions).

Direct I/O via the O_DIRECT flag ensures that data transfers go directly between user-space buffers and the storage device, avoiding double-copying through the kernel page cache. This is essential for large tensor transfers where page cache pollution would degrade overall system performance and make throughput unpredictable.

The asynchronous I/O model uses a submit-and-reap pattern: I/O control blocks (iocbs) are submitted to the kernel, which processes them without blocking the calling thread. The application later reaps completed events. Two scheduling strategies exist: sequential (submit a batch, wait for the full batch, repeat) and overlapped (continuously submit new requests as earlier ones complete, maintaining a full pipeline).

Usage

Apply this principle when designing storage-tier offloading for ML training or inference systems. The choice between sequential and overlapped I/O affects throughput: overlapped mode achieves higher sustained bandwidth on high-queue-depth NVMe devices, while sequential mode is simpler and sufficient for devices with low internal parallelism.

Theoretical Basis

Direct I/O and Page Cache Bypass

Standard file I/O passes through the kernel's page cache, which introduces two copies (user buffer to page cache, page cache to device) and causes cache pollution when transferring large contiguous regions. Direct I/O (O_DIRECT) eliminates the intermediate copy, requiring that buffers be aligned to the filesystem block size. The throughput improvement for large sequential transfers is typically 10-30% over buffered I/O, with significantly more predictable latency.

Queue Depth and NVMe Parallelism

NVMe devices achieve peak throughput only when multiple I/O requests are in flight simultaneously, because the device has internal parallelism across multiple flash channels. The queue depth parameter controls how many requests the application keeps in flight. Optimal queue depth depends on the device: consumer NVMe SSDs typically saturate at queue depth 4-8, while enterprise devices may benefit from queue depth 32 or higher.

Sequential vs. Overlapped Submission

In sequential mode, each batch of iocbs is submitted and fully reaped before the next batch is submitted. This creates idle periods between batches. In overlapped mode, the application maintains a sliding window of in-flight requests, submitting new ones as completions arrive. This keeps the device queue full and eliminates inter-batch gaps, achieving higher sustained throughput at the cost of more complex bookkeeping.

Performance Metrics

Key metrics for evaluating async I/O performance include:

Submit latency: Time to hand off iocbs to the kernel via io_submit.
Completion latency: Time for io_getevents to return completed events.
End-to-end rate: Total bytes transferred divided by wall-clock time, measured in GB/s.

Related Pages

Implementation:FMInference_FlexLLMGen_DeepSpeed_AIO_Common

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment