Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:FMInference FlexLLMGen NVMe Disk Setup

From Leeroopedia


Field Value
Sources FlexGen, FlexLLMGen
Domains Infrastructure, Storage
Last Updated 2026-02-09 00:00 GMT

Overview

An infrastructure preparation step that mounts NVMe SSDs as the disk tier in the three-tier memory hierarchy, providing high-bandwidth storage for tensor offloading.

Description

FlexLLMGen's three-tier offloading (GPU -> CPU -> Disk) requires a fast disk backend for the third tier. NVMe SSDs provide the necessary I/O bandwidth (typically 2-7 GB/s) compared to regular SATA SSDs (~500 MB/s). The setup involves formatting and mounting NVMe devices on cloud instances (AWS or GCP) to a dedicated offload directory. On AWS, a single NVMe device is formatted with ext4 and mounted. On GCP, multiple NVMe devices are combined into a RAID-0 array for maximum bandwidth.

Usage

Required before any inference workflow that offloads tensors to disk (i.e., when w_gpu_percent + w_cpu_percent < 100, or similar for cache/activations). Not needed for all-GPU or GPU+CPU-only configurations.

Theoretical Basis

The throughput of the disk tier directly impacts inference latency when offloading is active. NVMe over PCIe provides ~3.5 GB/s per device. RAID-0 across multiple NVMe devices scales bandwidth linearly, which is critical for 175B-parameter models where disk I/O is the bottleneck.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment