Implementation:FMInference FlexLLMGen Policy

Field	Value
Sources	Repo: FlexLLMGen, Paper: FlexGen
Domains	Inference_Optimization, Memory_Management
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for configuring three-tier memory offloading strategy provided by the FlexLLMGen library.

Description

The Policy class is a frozen dataclass that controls how model weights, KV cache, and activations are distributed across GPU, CPU, and disk. It also controls I/O-compute overlap, attention sparsity, and optional 4-bit quantization for weights and cache.

The six percentage parameters (w_gpu_percent, w_cpu_percent, cache_gpu_percent, cache_cpu_percent, act_gpu_percent, act_cpu_percent) control GPU and CPU allocation for each tensor type. The remainder automatically goes to disk. For example, setting w_gpu_percent=20 and w_cpu_percent=80 means 0% of weights reside on disk.

Additional controls include:

overlap -- Enables I/O-compute pipelining to hide data transfer latency.
sep_layer -- Separates attention and MLP into two distinct layers for finer-grained scheduling.
pin_weight -- Uses pinned (page-locked) memory for CPU weight storage, accelerating GPU transfers.
cpu_cache_compute -- Computes attention on CPU rather than transferring the KV cache to GPU.
attn_sparsity -- Controls attention weight sparsity (0.0 = dense, 1.0 = fully sparse).
compress_weight / compress_cache -- Enable 4-bit group quantization for weights and KV cache respectively.

Usage

Create a Policy instance before initializing OptLM. The six percentage parameters control GPU/CPU allocation (remainder goes to disk). Enable overlap=True for I/O-compute pipelining.

Code Reference

Field	Value
Repository	FlexLLMGen
File	flexllmgen/flex_opt.py
Lines	33-80

Signature:

@dataclasses.dataclass(frozen=True)
class Policy:
    gpu_batch_size: int
    num_gpu_batches: int
    w_gpu_percent: float
    w_cpu_percent: float
    cache_gpu_percent: float
    cache_cpu_percent: float
    act_gpu_percent: float
    act_cpu_percent: float
    overlap: bool
    sep_layer: bool
    pin_weight: bool
    cpu_cache_compute: bool
    attn_sparsity: float
    compress_weight: bool
    comp_weight_config: CompressionConfig
    compress_cache: bool
    comp_cache_config: CompressionConfig

Import:

from flexllmgen.flex_opt import Policy

I/O Contract

Inputs

Parameter	Type	Required	Description
gpu_batch_size	int	Yes	Number of sequences per GPU micro-batch
num_gpu_batches	int	Yes	Number of micro-batches to pipeline
w_gpu_percent	float	Yes	Percentage of weights on GPU
w_cpu_percent	float	Yes	Percentage of weights on CPU
cache_gpu_percent	float	Yes	Percentage of KV cache on GPU
cache_cpu_percent	float	Yes	Percentage of KV cache on CPU
act_gpu_percent	float	Yes	Percentage of activations on GPU
act_cpu_percent	float	Yes	Percentage of activations on CPU
overlap	bool	Yes	Enable I/O-compute overlap
sep_layer	bool	Yes	Separate attention and MLP as two layers
pin_weight	bool	Yes	Use pinned memory for CPU weights
cpu_cache_compute	bool	Yes	Compute attention on CPU
attn_sparsity	float	Yes	Attention weight sparsity 0.0-1.0
compress_weight	bool	Yes	Enable 4-bit weight quantization
comp_weight_config	CompressionConfig	Yes	Weight compression parameters
compress_cache	bool	Yes	Enable 4-bit cache quantization
comp_cache_config	CompressionConfig	Yes	Cache compression parameters

Outputs

Output	Type	Description
Policy	frozen dataclass instance	Controls all memory placement decisions
w_disk_percent	computed property	100 - w_gpu_percent - w_cpu_percent
cache_disk_percent	computed property	100 - cache_gpu_percent - cache_cpu_percent
act_disk_percent	computed property	100 - act_gpu_percent - act_cpu_percent

Usage Examples

Example 1: All-GPU inference (model fits in GPU memory)

from flexllmgen.flex_opt import Policy
from flexllmgen.compression import CompressionConfig

# All tensors on GPU -- no offloading
policy_gpu = Policy(
    gpu_batch_size=4,
    num_gpu_batches=1,
    # 100% weights on GPU, 0% on CPU, 0% on disk
    w_gpu_percent=100,
    w_cpu_percent=0,
    # 100% KV cache on GPU
    cache_gpu_percent=100,
    cache_cpu_percent=0,
    # 100% activations on GPU
    act_gpu_percent=100,
    act_cpu_percent=0,
    overlap=False,
    sep_layer=False,
    pin_weight=False,
    cpu_cache_compute=False,
    attn_sparsity=1.0,
    compress_weight=False,
    comp_weight_config=CompressionConfig(num_bits=4, group_size=64, group_dim=0, symmetric=False, enabled=False),
    compress_cache=False,
    comp_cache_config=CompressionConfig(num_bits=4, group_size=64, group_dim=2, symmetric=False, enabled=False),
)

Example 2: Offloaded inference with compression (large model on limited GPU)

from flexllmgen.flex_opt import Policy
from flexllmgen.compression import CompressionConfig

# Offload weights and cache to CPU/disk with 4-bit compression
policy_offload = Policy(
    gpu_batch_size=2,
    num_gpu_batches=4,
    # 0% weights on GPU, 50% on CPU, 50% on disk
    w_gpu_percent=0,
    w_cpu_percent=50,
    # 0% KV cache on GPU, 50% on CPU, 50% on disk
    cache_gpu_percent=0,
    cache_cpu_percent=50,
    # 0% activations on GPU, 100% on CPU, 0% on disk
    act_gpu_percent=0,
    act_cpu_percent=100,
    overlap=True,
    sep_layer=False,
    pin_weight=True,
    cpu_cache_compute=False,
    attn_sparsity=1.0,
    compress_weight=True,
    comp_weight_config=CompressionConfig(num_bits=4, group_size=64, group_dim=0, symmetric=False, enabled=True),
    compress_cache=True,
    comp_cache_config=CompressionConfig(num_bits=4, group_size=64, group_dim=2, symmetric=False, enabled=True),
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment