Workflow:FMInference FlexLLMGen Single GPU Offloaded Inference
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Offloading, Throughput_Optimization |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
End-to-end process for running high-throughput generative inference of large OPT language models (up to 175B parameters) on a single GPU by offloading weights, KV cache, and activations across GPU, CPU, and disk.
Description
This workflow covers the primary use case of FlexLLMGen: running inference on OPT models that are too large to fit entirely in GPU memory. The system aggregates memory from three tiers (GPU, CPU DRAM, and NVMe/SSD disk) and uses a policy-based offloading strategy to distribute model weights, attention KV cache, and activations across these tiers. A block schedule overlaps I/O transfers with GPU computation to maximize throughput. Optional 4-bit group-wise quantization further reduces memory requirements, enabling even larger effective batch sizes and higher throughput on consumer hardware (e.g., a single NVIDIA T4 with 16GB VRAM).
Usage
Execute this workflow when you need to run batch inference with an OPT model (from OPT-1.3B to OPT-175B) on a machine with limited GPU memory but sufficient CPU DRAM and optional NVMe storage. This is designed for throughput-oriented workloads where latency is less critical, such as processing large document corpora, benchmarking, or batch classification tasks.
Execution Steps
Step 1: Environment Setup
Prepare the hardware and software environment. Install FlexLLMGen (via pip or from source), ensure PyTorch >= 1.12 is available, and if using disk offloading, mount a fast NVMe/SSD drive as the offload directory. On cloud instances (AWS or GCP), use the provided mount scripts to configure local NVMe drives.
Key considerations:
- NVMe SSD is recommended for disk offloading to achieve best I/O bandwidth
- The offload directory defaults to ~/flexllmgen_offload_dir
- On AWS, use scripts/mount_nvme_aws.sh; on GCP, use scripts/mount_nvme_gcp.sh (RAID-0 across 4 drives)
Step 2: Configure Offloading Policy
Define the offloading policy that controls how model weights, attention KV cache, and activations are distributed across GPU, CPU, and disk. The policy is specified via six percentage values controlling the GPU and CPU allocation for each tensor type, with the remainder going to disk. Also configure batch size, number of GPU batches, and optional compression settings.
Key considerations:
- The --percent flag takes six integers: weight_gpu%, weight_cpu%, cache_gpu%, cache_cpu%, activation_gpu%, activation_cpu%
- The remainder for each pair (100 - gpu% - cpu%) is allocated to disk
- Larger batch sizes increase throughput but require more memory for KV cache
- Enable --compress-weight and --compress-cache for 4-bit quantization to reduce memory by approximately 4x
- Use --pin-weight 0 to reduce CPU memory usage at the cost of slower transfers
- The experimental cost model (experimental/cost_model.py) can help find optimal policies via linear programming
Step 3: Download and Prepare Model Weights
Obtain the OPT model weights. For models up to OPT-66B, FlexLLMGen automatically downloads weights from HuggingFace and converts them to NumPy format. For OPT-175B, weights must be manually downloaded from Meta's metaseq repository, consolidated from 992 FSDP shards into a single checkpoint, and then converted to individual NumPy files.
Key considerations:
- Weights are cached in ~/opt_weights by default (configurable via --path)
- The opt_config module handles automatic download and conversion for standard model sizes
- For OPT-175B: use scripts/step_2_consolidate_992_shards_to_singleton.py then scripts/step_3_convert_to_numpy_weights.py
- Weight format is one .npy file per tensor
Step 4: Initialize Execution Environment
Create the three-tier execution environment consisting of a TorchDevice (GPU), TorchDevice (CPU with pinned memory), and TorchDisk (for SSD offloading). This sets up CUDA streams for overlapped I/O and compute, copy threads for asynchronous disk transfers, and memory pools for each device tier.
Key considerations:
- ExecutionEnv.create() initializes GPU, CPU, and Disk devices with copy thread pools
- CUDA streams are used for concurrent weight loading, cache reading/writing, and computation
- The disk device uses background threads for asynchronous file I/O
Step 5: Load Model with Offloading
Instantiate the OptLM model, which creates layer objects (InputEmbed, SelfAttention, MLP, OutputEmbed) and distributes their weights across GPU, CPU, and disk according to the offloading policy. Each layer's weights are placed on the appropriate device tier based on the cumulative percentage thresholds.
Key considerations:
- The Policy dataclass determines weight placement per layer
- Weights are distributed proportionally: early layers may be on GPU, middle on CPU, later on disk
- When sep_layer is enabled (default), attention and MLP are treated as separate layers for finer-grained scheduling
- The model supports OPT architectures from 125M to 175B parameters
Step 6: Run Generation with Block Schedule
Execute the generation loop using the block schedule that iterates over (generation_step, layer, gpu_batch) dimensions. Three execution strategies are available depending on the overlap setting and number of GPU batches: normal (no overlap), overlap_single_batch, and overlap_multi_batch. The overlapped strategies pipeline weight loading, cache I/O, and computation across CUDA streams.
Key considerations:
- The overlap_multi_batch strategy provides the best throughput by pipelining across multiple micro-batches
- Each generation step processes all layers sequentially, loading weights and cache on-the-fly
- Prefill (processing the full prompt) runs first, followed by autoregressive token generation
- The generation API supports sampling parameters (temperature, do_sample) and stop tokens
Step 7: Collect Results and Shutdown
Decode the generated output token IDs back to text, collect performance metrics (prefill latency, decode throughput, GPU peak memory), and close the execution environment's copy threads to release resources.
Key considerations:
- Throughput is measured as generated tokens per second over the total runtime
- Performance logs are written for benchmarking purposes
- The environment must be properly shut down to release copy threads and CUDA resources