Workflow:FMInference FlexLLMGen Single GPU Offloaded Inference

Knowledge Sources	FlexLLMGen FlexGen: High-Throughput Generative Inference
Domains	LLM_Inference, Offloading, Throughput_Optimization
Last Updated	2026-02-09 12:00 GMT

Overview

End-to-end process for running high-throughput generative inference of large OPT language models (up to 175B parameters) on a single GPU by offloading weights, KV cache, and activations across GPU, CPU, and disk.

Description

This workflow covers the primary use case of FlexLLMGen: running inference on OPT models that are too large to fit entirely in GPU memory. The system aggregates memory from three tiers (GPU, CPU DRAM, and NVMe/SSD disk) and uses a policy-based offloading strategy to distribute model weights, attention KV cache, and activations across these tiers. A block schedule overlaps I/O transfers with GPU computation to maximize throughput. Optional 4-bit group-wise quantization further reduces memory requirements, enabling even larger effective batch sizes and higher throughput on consumer hardware (e.g., a single NVIDIA T4 with 16GB VRAM).

Usage

Execute this workflow when you need to run batch inference with an OPT model (from OPT-1.3B to OPT-175B) on a machine with limited GPU memory but sufficient CPU DRAM and optional NVMe storage. This is designed for throughput-oriented workloads where latency is less critical, such as processing large document corpora, benchmarking, or batch classification tasks.

Execution Steps

Step 1: Environment Setup

Prepare the hardware and software environment. Install FlexLLMGen (via pip or from source), ensure PyTorch >= 1.12 is available, and if using disk offloading, mount a fast NVMe/SSD drive as the offload directory. On cloud instances (AWS or GCP), use the provided mount scripts to configure local NVMe drives.

Key considerations:

NVMe SSD is recommended for disk offloading to achieve best I/O bandwidth
The offload directory defaults to ~/flexllmgen_offload_dir
On AWS, use scripts/mount_nvme_aws.sh; on GCP, use scripts/mount_nvme_gcp.sh (RAID-0 across 4 drives)

Step 2: Configure Offloading Policy

Define the offloading policy that controls how model weights, attention KV cache, and activations are distributed across GPU, CPU, and disk. The policy is specified via six percentage values controlling the GPU and CPU allocation for each tensor type, with the remainder going to disk. Also configure batch size, number of GPU batches, and optional compression settings.

Key considerations:

The --percent flag takes six integers: weight_gpu%, weight_cpu%, cache_gpu%, cache_cpu%, activation_gpu%, activation_cpu%
The remainder for each pair (100 - gpu% - cpu%) is allocated to disk
Larger batch sizes increase throughput but require more memory for KV cache
Enable --compress-weight and --compress-cache for 4-bit quantization to reduce memory by approximately 4x
Use --pin-weight 0 to reduce CPU memory usage at the cost of slower transfers
The experimental cost model (experimental/cost_model.py) can help find optimal policies via linear programming

Step 3: Download and Prepare Model Weights

Obtain the OPT model weights. For models up to OPT-66B, FlexLLMGen automatically downloads weights from HuggingFace and converts them to NumPy format. For OPT-175B, weights must be manually downloaded from Meta's metaseq repository, consolidated from 992 FSDP shards into a single checkpoint, and then converted to individual NumPy files.

Key considerations:

Weights are cached in ~/opt_weights by default (configurable via --path)
The opt_config module handles automatic download and conversion for standard model sizes
For OPT-175B: use scripts/step_2_consolidate_992_shards_to_singleton.py then scripts/step_3_convert_to_numpy_weights.py
Weight format is one .npy file per tensor

Step 4: Initialize Execution Environment

Create the three-tier execution environment consisting of a TorchDevice (GPU), TorchDevice (CPU with pinned memory), and TorchDisk (for SSD offloading). This sets up CUDA streams for overlapped I/O and compute, copy threads for asynchronous disk transfers, and memory pools for each device tier.

Key considerations:

ExecutionEnv.create() initializes GPU, CPU, and Disk devices with copy thread pools
CUDA streams are used for concurrent weight loading, cache reading/writing, and computation
The disk device uses background threads for asynchronous file I/O

Step 5: Load Model with Offloading

Instantiate the OptLM model, which creates layer objects (InputEmbed, SelfAttention, MLP, OutputEmbed) and distributes their weights across GPU, CPU, and disk according to the offloading policy. Each layer's weights are placed on the appropriate device tier based on the cumulative percentage thresholds.

Key considerations:

The Policy dataclass determines weight placement per layer
Weights are distributed proportionally: early layers may be on GPU, middle on CPU, later on disk
When sep_layer is enabled (default), attention and MLP are treated as separate layers for finer-grained scheduling
The model supports OPT architectures from 125M to 175B parameters

Step 6: Run Generation with Block Schedule

Execute the generation loop using the block schedule that iterates over (generation_step, layer, gpu_batch) dimensions. Three execution strategies are available depending on the overlap setting and number of GPU batches: normal (no overlap), overlap_single_batch, and overlap_multi_batch. The overlapped strategies pipeline weight loading, cache I/O, and computation across CUDA streams.

Key considerations:

The overlap_multi_batch strategy provides the best throughput by pipelining across multiple micro-batches
Each generation step processes all layers sequentially, loading weights and cache on-the-fly
Prefill (processing the full prompt) runs first, followed by autoregressive token generation
The generation API supports sampling parameters (temperature, do_sample) and stop tokens

Step 7: Collect Results and Shutdown

Decode the generated output token IDs back to text, collect performance metrics (prefill latency, decode throughput, GPU peak memory), and close the execution environment's copy threads to release resources.

Key considerations:

Throughput is measured as generated tokens per second over the total runtime
Performance logs are written for benchmarking purposes
The environment must be properly shut down to release copy threads and CUDA resources

Execution Diagram

GitHub URL

Workflow Repository