Principle:FMInference FlexLLMGen NVMe Disk Setup
| Field | Value |
|---|---|
| Sources | FlexGen, FlexLLMGen |
| Domains | Infrastructure, Storage |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An infrastructure preparation step that mounts NVMe SSDs as the disk tier in the three-tier memory hierarchy, providing high-bandwidth storage for tensor offloading.
Description
FlexLLMGen's three-tier offloading (GPU -> CPU -> Disk) requires a fast disk backend for the third tier. NVMe SSDs provide the necessary I/O bandwidth (typically 2-7 GB/s) compared to regular SATA SSDs (~500 MB/s). The setup involves formatting and mounting NVMe devices on cloud instances (AWS or GCP) to a dedicated offload directory. On AWS, a single NVMe device is formatted with ext4 and mounted. On GCP, multiple NVMe devices are combined into a RAID-0 array for maximum bandwidth.
Usage
Required before any inference workflow that offloads tensors to disk (i.e., when w_gpu_percent + w_cpu_percent < 100, or similar for cache/activations). Not needed for all-GPU or GPU+CPU-only configurations.
Theoretical Basis
The throughput of the disk tier directly impacts inference latency when offloading is active. NVMe over PCIe provides ~3.5 GB/s per device. RAID-0 across multiple NVMe devices scales bandwidth linearly, which is critical for 175B-parameter models where disk I/O is the bottleneck.