Principle:FMInference FlexLLMGen Offloading Policy Configuration

Field	Value
Sources	Paper: FlexGen, Repo: FlexLLMGen
Domains	Inference_Optimization, Memory_Management
Last Updated	2026-02-09 00:00 GMT

Overview

A memory management strategy that distributes model tensors (weights, KV cache, activations) across a three-tier hierarchy of GPU, CPU, and disk to enable inference of models larger than GPU memory.

Description

Policy-based offloading is the core idea behind FlexGen's ability to run extremely large language models on limited hardware. Rather than requiring the entire model to fit in GPU memory, the system allows users to control what percentage of weights, KV cache, and activations reside on each tier of the memory hierarchy: GPU, CPU (DRAM), and disk (NVMe SSD).

This approach enables running 175B-parameter models on a single GPU by treating the three storage tiers as a unified memory pool. The offloading policy is expressed as a frozen dataclass that controls all placement decisions for every tensor type during inference.

Key aspects of the policy-based offloading approach:

Per-tensor-type control -- Weights, KV cache, and activations each have independent GPU/CPU/disk allocation percentages.
Complementary percentages -- For each tensor type, the disk percentage is computed as the remainder: disk% = 100 - gpu% - cpu%.
Frozen configuration -- Once created, the policy is immutable, ensuring consistent behavior throughout the inference run.
Holistic optimization -- The policy also controls I/O-compute overlap, attention sparsity, and optional 4-bit quantization for weights and cache.

Usage

Use policy-based offloading when running inference on models that exceed available GPU memory. The policy specifies the percentage of each tensor type stored on each tier of the memory hierarchy. This is combined with I/O-compute overlap to hide data transfer latency and quantization to further reduce memory footprint and bandwidth requirements.

Typical scenarios include:

Running OPT-175B on a single 16 GB GPU with 200 GB CPU DRAM and NVMe SSD.
Trading throughput for accessibility by offloading most tensors to CPU or disk.
Fine-tuning the GPU/CPU/disk balance to maximize throughput for a given hardware setup.

Theoretical Basis

The three-tier offloading model treats GPU, CPU DRAM, and NVMe SSD as a unified memory pool. Given a fixed GPU memory budget M_gpu, CPU budget M_cpu, and disk budget M_disk, the policy assigns fractions for each tensor type:

Weights: w_gpu%, w_cpu% (remainder to disk)
KV cache: cache_gpu%, cache_cpu% (remainder to disk)
Activations: act_gpu%, act_cpu% (remainder to disk)

The total memory consumption on each tier must satisfy:

w_gpu% * W + cache_gpu% * C + act_gpu% * A <= M_gpu
w_cpu% * W + cache_cpu% * C + act_cpu% * A <= M_cpu

where W is total weight size, C is total KV cache size, and A is total activation size.

The optimal policy can be found via linear programming using a cost model that accounts for compute time, memory read/write latencies for each tier, and overlap opportunities. This cost model minimizes end-to-end latency subject to the memory capacity constraints.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment