Principle:FMInference FlexLLMGen Async IO NVMe Operations
| Knowledge Sources | |
|---|---|
| Domains | Async IO, NVMe Storage, Systems Programming |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Asynchronous, direct I/O techniques for transferring large tensors between host memory and NVMe storage devices with minimal kernel overhead and predictable throughput.
Description
When model parameters or optimizer states must be swapped to NVMe storage, the I/O layer must achieve near-hardware-peak bandwidth while avoiding interference with ongoing GPU computation. This requires three key techniques working together: direct I/O (bypassing the OS page cache), asynchronous submission (decoupling I/O requests from their completion), and operation pipelining (overlapping new submissions with in-flight completions).
Direct I/O via the O_DIRECT flag ensures that data transfers go directly between user-space buffers and the storage device, avoiding double-copying through the kernel page cache. This is essential for large tensor transfers where page cache pollution would degrade overall system performance and make throughput unpredictable.
The asynchronous I/O model uses a submit-and-reap pattern: I/O control blocks (iocbs) are submitted to the kernel, which processes them without blocking the calling thread. The application later reaps completed events. Two scheduling strategies exist: sequential (submit a batch, wait for the full batch, repeat) and overlapped (continuously submit new requests as earlier ones complete, maintaining a full pipeline).
Usage
Apply this principle when designing storage-tier offloading for ML training or inference systems. The choice between sequential and overlapped I/O affects throughput: overlapped mode achieves higher sustained bandwidth on high-queue-depth NVMe devices, while sequential mode is simpler and sufficient for devices with low internal parallelism.
Theoretical Basis
Direct I/O and Page Cache Bypass
Standard file I/O passes through the kernel's page cache, which introduces two copies (user buffer to page cache, page cache to device) and causes cache pollution when transferring large contiguous regions. Direct I/O (O_DIRECT) eliminates the intermediate copy, requiring that buffers be aligned to the filesystem block size. The throughput improvement for large sequential transfers is typically 10-30% over buffered I/O, with significantly more predictable latency.
Queue Depth and NVMe Parallelism
NVMe devices achieve peak throughput only when multiple I/O requests are in flight simultaneously, because the device has internal parallelism across multiple flash channels. The queue depth parameter controls how many requests the application keeps in flight. Optimal queue depth depends on the device: consumer NVMe SSDs typically saturate at queue depth 4-8, while enterprise devices may benefit from queue depth 32 or higher.
Sequential vs. Overlapped Submission
In sequential mode, each batch of iocbs is submitted and fully reaped before the next batch is submitted. This creates idle periods between batches. In overlapped mode, the application maintains a sliding window of in-flight requests, submitting new ones as completions arrive. This keeps the device queue full and eliminates inter-batch gaps, achieving higher sustained throughput at the cost of more complex bookkeeping.
Performance Metrics
Key metrics for evaluating async I/O performance include:
- Submit latency: Time to hand off iocbs to the kernel via
io_submit. - Completion latency: Time for
io_geteventsto return completed events. - End-to-end rate: Total bytes transferred divided by wall-clock time, measured in GB/s.