Principle:FMInference FlexLLMGen AIO Performance Benchmarking
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, NVMe Storage, Performance Tuning |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Systematic exploration of asynchronous I/O configuration parameters to identify optimal throughput settings for a specific NVMe storage device and system configuration.
Description
Achieving peak NVMe I/O throughput for tensor swapping requires tuning multiple interacting parameters. Because the optimal configuration depends on hardware-specific factors (NVMe controller design, PCIe topology, CPU architecture, kernel version), the most reliable approach is empirical combinatorial search: measuring actual throughput across all combinations of relevant parameters and selecting the configuration that maximizes sustained bandwidth.
The key insight is that no single parameter dominates performance in isolation. Block size affects the granularity of I/O requests and interacts with the NVMe controller's internal alignment. Queue depth determines how many requests are in flight simultaneously and must match the device's internal parallelism. Overlap mode versus sequential mode determines whether the software pipeline can keep the device queue saturated. I/O parallelism (number of threads) determines how many independent I/O streams are active. These parameters interact non-linearly, making exhaustive measurement the most practical optimization strategy.
Usage
Apply this principle when deploying NVMe-based tensor offloading on new hardware. Run the performance sweep once per unique hardware configuration (NVMe device + host system) and use the discovered optimal parameters for all subsequent inference or training runs.
Theoretical Basis
Combinatorial Parameter Search
Given n parameters each with k_i possible values, the total search space is the Cartesian product of all parameter value sets. For typical AIO configurations:
- Block size: 2-4 values (e.g., 128K, 256K, 512K, 1M)
- Queue depth: 3-5 values (e.g., 4, 8, 16, 32, 64)
- Overlap mode: 2 values (true/false)
- I/O parallelism: 3-4 values (e.g., 1, 2, 4, 8)
- Single submit: 2 values (true/false)
This yields a manageable search space of 100-300 configurations, each taking seconds to evaluate, making exhaustive search practical.
Page Cache Interference
When benchmarking storage I/O, the operating system's page cache can mask true device performance by serving subsequent reads from DRAM. To obtain accurate measurements:
- Call
syncto flush pending writes to the device. - Drop the page cache (
echo 1 > /proc/sys/vm/drop_caches) to force reads to originate from the device.
Without cache flushing, reported read throughput may be 10-100x higher than actual device throughput, depending on how much test data fits in available DRAM.
Read vs. Write Asymmetry
NVMe devices typically exhibit different performance characteristics for reads versus writes due to internal write amplification, garbage collection, and wear-leveling operations. Sweeping both read and write independently is necessary because optimal parameters may differ between the two operations.