Heuristic:FMInference FlexLLMGen OOM Memory Management
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLM_Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Three-step strategy for handling GPU/CPU out-of-memory errors during FlexLLMGen inference: unpin weights, enable compression, and offload everything to disk.
Description
FlexLLMGen's offloading engine distributes weights, KV cache, and activations across GPU, CPU, and disk. When available memory is insufficient, users encounter OOM errors. The README documents three progressive memory-saving strategies, each trading throughput for reduced memory usage. These strategies can be combined and should be applied in order from least to most aggressive.
Usage
Use this heuristic when encountering CUDA out of memory or CPU memory exhaustion during FlexLLMGen inference. Apply the strategies progressively: start with unpinning weights, then add compression, and finally offload everything to disk. Each step saves more memory but reduces throughput.
The Insight (Rule of Thumb)
- Strategy 1 - Unpin weights: Add
--pin-weight 0. This reduces CPU weight memory usage by around 20% or more. Trade-off: slightly slower CPU-to-GPU transfers because non-pinned memory requires an extra copy through a pinned relay buffer. - Strategy 2 - Enable weight compression: Add
--compress-weight. This reduces weight memory usage by around 70% via 4-bit group-wise quantization. Trade-off: decompression overhead during inference, negligible accuracy loss. - Strategy 3 - Full disk offload: Use
--percent 0 0 100 0 100 0. This offloads all weights to disk, requiring very little CPU and GPU memory. Trade-off: significantly slower due to disk I/O bottleneck. - Combine strategies: All three can be used together for maximum memory savings on the most constrained hardware.
Reasoning
FlexLLMGen's memory hierarchy has three tiers with different capacity/bandwidth trade-offs:
- GPU VRAM: Fastest but smallest (typically 16-80GB).
- CPU DRAM: Medium speed, larger capacity (typically 64-256GB). Pinned memory doubles the CPU memory footprint because pinned buffers cannot be swapped.
- NVMe Disk: Slowest but virtually unlimited capacity (typically 1-4TB SSD).
The README FAQ section explicitly documents these strategies based on the developers' empirical experience. Strategy 1 (unpin) addresses the specific overhead of pinned memory allocation. Strategy 2 (compression) leverages the 4-bit quantization engine built into FlexLLMGen. Strategy 3 (full disk offload) is the last resort that trades all bandwidth for capacity.
Code Evidence
README FAQ section at README.md:163-169:
#### How to handle out-of-memory?
If you do not have enough GPU/CPU memory, here are a few things you can try.
They save more memory but run slower.
- Do not pin weights by adding `--pin-weight 0`. This can reduce the weight memory
usage on CPU by around 20% or more.
- Enable weight compression by adding `--compress-weight`. This can reduce the
weight memory usage by around 70%.
- Offload all weights to disk by using `--percent 0 0 100 0 100 0`. This requires
very little CPU and GPU memory.
Pin-weight default (True) in flexllmgen/flex_opt.py:1302-1303:
parser.add_argument("--pin-weight", type=str2bool, nargs="?",
const=True, default=True)
Pinned memory relay for non-pinned CPU tensors in flexllmgen/pytorch_backend.py:845-849:
elif (src.device.device_type == DeviceType.CPU and
dst.device.device_type == DeviceType.CUDA and
not src.data.is_pinned()):
# The cpu tensor is not pinned, use pin_memory as a relay
src = src.pin_memory()