Principle:FMInference FlexLLMGen NVMe Disk Setup

Field	Value
Sources	FlexGen, FlexLLMGen
Domains	Infrastructure, Storage
Last Updated	2026-02-09 00:00 GMT

Overview

An infrastructure preparation step that mounts NVMe SSDs as the disk tier in the three-tier memory hierarchy, providing high-bandwidth storage for tensor offloading.

Description

FlexLLMGen's three-tier offloading (GPU -> CPU -> Disk) requires a fast disk backend for the third tier. NVMe SSDs provide the necessary I/O bandwidth (typically 2-7 GB/s) compared to regular SATA SSDs (~500 MB/s). The setup involves formatting and mounting NVMe devices on cloud instances (AWS or GCP) to a dedicated offload directory. On AWS, a single NVMe device is formatted with ext4 and mounted. On GCP, multiple NVMe devices are combined into a RAID-0 array for maximum bandwidth.

Usage

Required before any inference workflow that offloads tensors to disk (i.e., when w_gpu_percent + w_cpu_percent < 100, or similar for cache/activations). Not needed for all-GPU or GPU+CPU-only configurations.

Theoretical Basis

The throughput of the disk tier directly impacts inference latency when offloading is active. NVMe over PCIe provides ~3.5 GB/s per device. RAID-0 across multiple NVMe devices scales bandwidth linearly, which is critical for 175B-parameter models where disk I/O is the bottleneck.

Related Pages

Implementation:FMInference_FlexLLMGen_Mount_NVMe_Scripts

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment