Principle:FMInference FlexLLMGen AIO Performance Benchmarking

Knowledge Sources	FMInference_FlexLLMGen
Domains	Benchmarking, NVMe Storage, Performance Tuning
Last Updated	2026-02-09 12:00 GMT

Overview

Systematic exploration of asynchronous I/O configuration parameters to identify optimal throughput settings for a specific NVMe storage device and system configuration.

Description

Achieving peak NVMe I/O throughput for tensor swapping requires tuning multiple interacting parameters. Because the optimal configuration depends on hardware-specific factors (NVMe controller design, PCIe topology, CPU architecture, kernel version), the most reliable approach is empirical combinatorial search: measuring actual throughput across all combinations of relevant parameters and selecting the configuration that maximizes sustained bandwidth.

The key insight is that no single parameter dominates performance in isolation. Block size affects the granularity of I/O requests and interacts with the NVMe controller's internal alignment. Queue depth determines how many requests are in flight simultaneously and must match the device's internal parallelism. Overlap mode versus sequential mode determines whether the software pipeline can keep the device queue saturated. I/O parallelism (number of threads) determines how many independent I/O streams are active. These parameters interact non-linearly, making exhaustive measurement the most practical optimization strategy.

Usage

Apply this principle when deploying NVMe-based tensor offloading on new hardware. Run the performance sweep once per unique hardware configuration (NVMe device + host system) and use the discovered optimal parameters for all subsequent inference or training runs.

Theoretical Basis

Combinatorial Parameter Search

Given n parameters each with k_i possible values, the total search space is the Cartesian product of all parameter value sets. For typical AIO configurations:

Block size: 2-4 values (e.g., 128K, 256K, 512K, 1M)
Queue depth: 3-5 values (e.g., 4, 8, 16, 32, 64)
Overlap mode: 2 values (true/false)
I/O parallelism: 3-4 values (e.g., 1, 2, 4, 8)
Single submit: 2 values (true/false)

This yields a manageable search space of 100-300 configurations, each taking seconds to evaluate, making exhaustive search practical.

Page Cache Interference

When benchmarking storage I/O, the operating system's page cache can mask true device performance by serving subsequent reads from DRAM. To obtain accurate measurements:

Call sync to flush pending writes to the device.
Drop the page cache (echo 1 > /proc/sys/vm/drop_caches) to force reads to originate from the device.

Without cache flushing, reported read throughput may be 10-100x higher than actual device throughput, depending on how much test data fits in available DRAM.

Read vs. Write Asymmetry

NVMe devices typically exhibit different performance characteristics for reads versus writes due to internal write amplification, garbage collection, and wear-leveling operations. Sweeping both read and write independently is necessary because optimal parameters may differ between the two operations.

Related Pages

Implementation:FMInference_FlexLLMGen_DeepSpeed_AIO_Perf_Sweep

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment