Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:FMInference FlexLLMGen OOM Memory Management

From Leeroopedia



Knowledge Sources
Domains Optimization, LLM_Inference
Last Updated 2026-02-09 12:00 GMT

Overview

Three-step strategy for handling GPU/CPU out-of-memory errors during FlexLLMGen inference: unpin weights, enable compression, and offload everything to disk.

Description

FlexLLMGen's offloading engine distributes weights, KV cache, and activations across GPU, CPU, and disk. When available memory is insufficient, users encounter OOM errors. The README documents three progressive memory-saving strategies, each trading throughput for reduced memory usage. These strategies can be combined and should be applied in order from least to most aggressive.

Usage

Use this heuristic when encountering CUDA out of memory or CPU memory exhaustion during FlexLLMGen inference. Apply the strategies progressively: start with unpinning weights, then add compression, and finally offload everything to disk. Each step saves more memory but reduces throughput.

The Insight (Rule of Thumb)

  • Strategy 1 - Unpin weights: Add --pin-weight 0. This reduces CPU weight memory usage by around 20% or more. Trade-off: slightly slower CPU-to-GPU transfers because non-pinned memory requires an extra copy through a pinned relay buffer.
  • Strategy 2 - Enable weight compression: Add --compress-weight. This reduces weight memory usage by around 70% via 4-bit group-wise quantization. Trade-off: decompression overhead during inference, negligible accuracy loss.
  • Strategy 3 - Full disk offload: Use --percent 0 0 100 0 100 0. This offloads all weights to disk, requiring very little CPU and GPU memory. Trade-off: significantly slower due to disk I/O bottleneck.
  • Combine strategies: All three can be used together for maximum memory savings on the most constrained hardware.

Reasoning

FlexLLMGen's memory hierarchy has three tiers with different capacity/bandwidth trade-offs:

  • GPU VRAM: Fastest but smallest (typically 16-80GB).
  • CPU DRAM: Medium speed, larger capacity (typically 64-256GB). Pinned memory doubles the CPU memory footprint because pinned buffers cannot be swapped.
  • NVMe Disk: Slowest but virtually unlimited capacity (typically 1-4TB SSD).

The README FAQ section explicitly documents these strategies based on the developers' empirical experience. Strategy 1 (unpin) addresses the specific overhead of pinned memory allocation. Strategy 2 (compression) leverages the 4-bit quantization engine built into FlexLLMGen. Strategy 3 (full disk offload) is the last resort that trades all bandwidth for capacity.

Code Evidence

README FAQ section at README.md:163-169:

#### How to handle out-of-memory?
If you do not have enough GPU/CPU memory, here are a few things you can try.
They save more memory but run slower.

- Do not pin weights by adding `--pin-weight 0`. This can reduce the weight memory
  usage on CPU by around 20% or more.
- Enable weight compression by adding `--compress-weight`. This can reduce the
  weight memory usage by around 70%.
- Offload all weights to disk by using `--percent 0 0 100 0 100 0`. This requires
  very little CPU and GPU memory.

Pin-weight default (True) in flexllmgen/flex_opt.py:1302-1303:

parser.add_argument("--pin-weight", type=str2bool, nargs="?",
    const=True, default=True)

Pinned memory relay for non-pinned CPU tensors in flexllmgen/pytorch_backend.py:845-849:

elif (src.device.device_type == DeviceType.CPU and
      dst.device.device_type == DeviceType.CUDA and
      not src.data.is_pinned()):
    # The cpu tensor is not pinned, use pin_memory as a relay
    src = src.pin_memory()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment