Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FMInference FlexLLMGen Policy

From Leeroopedia


Field Value
Sources Repo: FlexLLMGen, Paper: FlexGen
Domains Inference_Optimization, Memory_Management
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for configuring three-tier memory offloading strategy provided by the FlexLLMGen library.

Description

The Policy class is a frozen dataclass that controls how model weights, KV cache, and activations are distributed across GPU, CPU, and disk. It also controls I/O-compute overlap, attention sparsity, and optional 4-bit quantization for weights and cache.

The six percentage parameters (w_gpu_percent, w_cpu_percent, cache_gpu_percent, cache_cpu_percent, act_gpu_percent, act_cpu_percent) control GPU and CPU allocation for each tensor type. The remainder automatically goes to disk. For example, setting w_gpu_percent=20 and w_cpu_percent=80 means 0% of weights reside on disk.

Additional controls include:

  • overlap -- Enables I/O-compute pipelining to hide data transfer latency.
  • sep_layer -- Separates attention and MLP into two distinct layers for finer-grained scheduling.
  • pin_weight -- Uses pinned (page-locked) memory for CPU weight storage, accelerating GPU transfers.
  • cpu_cache_compute -- Computes attention on CPU rather than transferring the KV cache to GPU.
  • attn_sparsity -- Controls attention weight sparsity (0.0 = dense, 1.0 = fully sparse).
  • compress_weight / compress_cache -- Enable 4-bit group quantization for weights and KV cache respectively.

Usage

Create a Policy instance before initializing OptLM. The six percentage parameters control GPU/CPU allocation (remainder goes to disk). Enable overlap=True for I/O-compute pipelining.

Code Reference

Field Value
Repository FlexLLMGen
File flexllmgen/flex_opt.py
Lines 33-80

Signature:

@dataclasses.dataclass(frozen=True)
class Policy:
    gpu_batch_size: int
    num_gpu_batches: int
    w_gpu_percent: float
    w_cpu_percent: float
    cache_gpu_percent: float
    cache_cpu_percent: float
    act_gpu_percent: float
    act_cpu_percent: float
    overlap: bool
    sep_layer: bool
    pin_weight: bool
    cpu_cache_compute: bool
    attn_sparsity: float
    compress_weight: bool
    comp_weight_config: CompressionConfig
    compress_cache: bool
    comp_cache_config: CompressionConfig

Import:

from flexllmgen.flex_opt import Policy

I/O Contract

Inputs

Parameter Type Required Description
gpu_batch_size int Yes Number of sequences per GPU micro-batch
num_gpu_batches int Yes Number of micro-batches to pipeline
w_gpu_percent float Yes Percentage of weights on GPU
w_cpu_percent float Yes Percentage of weights on CPU
cache_gpu_percent float Yes Percentage of KV cache on GPU
cache_cpu_percent float Yes Percentage of KV cache on CPU
act_gpu_percent float Yes Percentage of activations on GPU
act_cpu_percent float Yes Percentage of activations on CPU
overlap bool Yes Enable I/O-compute overlap
sep_layer bool Yes Separate attention and MLP as two layers
pin_weight bool Yes Use pinned memory for CPU weights
cpu_cache_compute bool Yes Compute attention on CPU
attn_sparsity float Yes Attention weight sparsity 0.0-1.0
compress_weight bool Yes Enable 4-bit weight quantization
comp_weight_config CompressionConfig Yes Weight compression parameters
compress_cache bool Yes Enable 4-bit cache quantization
comp_cache_config CompressionConfig Yes Cache compression parameters

Outputs

Output Type Description
Policy frozen dataclass instance Controls all memory placement decisions
w_disk_percent computed property 100 - w_gpu_percent - w_cpu_percent
cache_disk_percent computed property 100 - cache_gpu_percent - cache_cpu_percent
act_disk_percent computed property 100 - act_gpu_percent - act_cpu_percent

Usage Examples

Example 1: All-GPU inference (model fits in GPU memory)

from flexllmgen.flex_opt import Policy
from flexllmgen.compression import CompressionConfig

# All tensors on GPU -- no offloading
policy_gpu = Policy(
    gpu_batch_size=4,
    num_gpu_batches=1,
    # 100% weights on GPU, 0% on CPU, 0% on disk
    w_gpu_percent=100,
    w_cpu_percent=0,
    # 100% KV cache on GPU
    cache_gpu_percent=100,
    cache_cpu_percent=0,
    # 100% activations on GPU
    act_gpu_percent=100,
    act_cpu_percent=0,
    overlap=False,
    sep_layer=False,
    pin_weight=False,
    cpu_cache_compute=False,
    attn_sparsity=1.0,
    compress_weight=False,
    comp_weight_config=CompressionConfig(num_bits=4, group_size=64, group_dim=0, symmetric=False, enabled=False),
    compress_cache=False,
    comp_cache_config=CompressionConfig(num_bits=4, group_size=64, group_dim=2, symmetric=False, enabled=False),
)

Example 2: Offloaded inference with compression (large model on limited GPU)

from flexllmgen.flex_opt import Policy
from flexllmgen.compression import CompressionConfig

# Offload weights and cache to CPU/disk with 4-bit compression
policy_offload = Policy(
    gpu_batch_size=2,
    num_gpu_batches=4,
    # 0% weights on GPU, 50% on CPU, 50% on disk
    w_gpu_percent=0,
    w_cpu_percent=50,
    # 0% KV cache on GPU, 50% on CPU, 50% on disk
    cache_gpu_percent=0,
    cache_cpu_percent=50,
    # 0% activations on GPU, 100% on CPU, 0% on disk
    act_gpu_percent=0,
    act_cpu_percent=100,
    overlap=True,
    sep_layer=False,
    pin_weight=True,
    cpu_cache_compute=False,
    attn_sparsity=1.0,
    compress_weight=True,
    comp_weight_config=CompressionConfig(num_bits=4, group_size=64, group_dim=0, symmetric=False, enabled=True),
    compress_cache=True,
    comp_cache_config=CompressionConfig(num_bits=4, group_size=64, group_dim=2, symmetric=False, enabled=True),
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment