Implementation:FMInference FlexLLMGen Policy
| Field | Value |
|---|---|
| Sources | Repo: FlexLLMGen, Paper: FlexGen |
| Domains | Inference_Optimization, Memory_Management |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for configuring three-tier memory offloading strategy provided by the FlexLLMGen library.
Description
The Policy class is a frozen dataclass that controls how model weights, KV cache, and activations are distributed across GPU, CPU, and disk. It also controls I/O-compute overlap, attention sparsity, and optional 4-bit quantization for weights and cache.
The six percentage parameters (w_gpu_percent, w_cpu_percent, cache_gpu_percent, cache_cpu_percent, act_gpu_percent, act_cpu_percent) control GPU and CPU allocation for each tensor type. The remainder automatically goes to disk. For example, setting w_gpu_percent=20 and w_cpu_percent=80 means 0% of weights reside on disk.
Additional controls include:
- overlap -- Enables I/O-compute pipelining to hide data transfer latency.
- sep_layer -- Separates attention and MLP into two distinct layers for finer-grained scheduling.
- pin_weight -- Uses pinned (page-locked) memory for CPU weight storage, accelerating GPU transfers.
- cpu_cache_compute -- Computes attention on CPU rather than transferring the KV cache to GPU.
- attn_sparsity -- Controls attention weight sparsity (0.0 = dense, 1.0 = fully sparse).
- compress_weight / compress_cache -- Enable 4-bit group quantization for weights and KV cache respectively.
Usage
Create a Policy instance before initializing OptLM. The six percentage parameters control GPU/CPU allocation (remainder goes to disk). Enable overlap=True for I/O-compute pipelining.
Code Reference
| Field | Value |
|---|---|
| Repository | FlexLLMGen |
| File | flexllmgen/flex_opt.py |
| Lines | 33-80 |
Signature:
@dataclasses.dataclass(frozen=True)
class Policy:
gpu_batch_size: int
num_gpu_batches: int
w_gpu_percent: float
w_cpu_percent: float
cache_gpu_percent: float
cache_cpu_percent: float
act_gpu_percent: float
act_cpu_percent: float
overlap: bool
sep_layer: bool
pin_weight: bool
cpu_cache_compute: bool
attn_sparsity: float
compress_weight: bool
comp_weight_config: CompressionConfig
compress_cache: bool
comp_cache_config: CompressionConfig
Import:
from flexllmgen.flex_opt import Policy
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| gpu_batch_size | int | Yes | Number of sequences per GPU micro-batch |
| num_gpu_batches | int | Yes | Number of micro-batches to pipeline |
| w_gpu_percent | float | Yes | Percentage of weights on GPU |
| w_cpu_percent | float | Yes | Percentage of weights on CPU |
| cache_gpu_percent | float | Yes | Percentage of KV cache on GPU |
| cache_cpu_percent | float | Yes | Percentage of KV cache on CPU |
| act_gpu_percent | float | Yes | Percentage of activations on GPU |
| act_cpu_percent | float | Yes | Percentage of activations on CPU |
| overlap | bool | Yes | Enable I/O-compute overlap |
| sep_layer | bool | Yes | Separate attention and MLP as two layers |
| pin_weight | bool | Yes | Use pinned memory for CPU weights |
| cpu_cache_compute | bool | Yes | Compute attention on CPU |
| attn_sparsity | float | Yes | Attention weight sparsity 0.0-1.0 |
| compress_weight | bool | Yes | Enable 4-bit weight quantization |
| comp_weight_config | CompressionConfig | Yes | Weight compression parameters |
| compress_cache | bool | Yes | Enable 4-bit cache quantization |
| comp_cache_config | CompressionConfig | Yes | Cache compression parameters |
Outputs
| Output | Type | Description |
|---|---|---|
| Policy | frozen dataclass instance | Controls all memory placement decisions |
| w_disk_percent | computed property | 100 - w_gpu_percent - w_cpu_percent |
| cache_disk_percent | computed property | 100 - cache_gpu_percent - cache_cpu_percent |
| act_disk_percent | computed property | 100 - act_gpu_percent - act_cpu_percent |
Usage Examples
Example 1: All-GPU inference (model fits in GPU memory)
from flexllmgen.flex_opt import Policy
from flexllmgen.compression import CompressionConfig
# All tensors on GPU -- no offloading
policy_gpu = Policy(
gpu_batch_size=4,
num_gpu_batches=1,
# 100% weights on GPU, 0% on CPU, 0% on disk
w_gpu_percent=100,
w_cpu_percent=0,
# 100% KV cache on GPU
cache_gpu_percent=100,
cache_cpu_percent=0,
# 100% activations on GPU
act_gpu_percent=100,
act_cpu_percent=0,
overlap=False,
sep_layer=False,
pin_weight=False,
cpu_cache_compute=False,
attn_sparsity=1.0,
compress_weight=False,
comp_weight_config=CompressionConfig(num_bits=4, group_size=64, group_dim=0, symmetric=False, enabled=False),
compress_cache=False,
comp_cache_config=CompressionConfig(num_bits=4, group_size=64, group_dim=2, symmetric=False, enabled=False),
)
Example 2: Offloaded inference with compression (large model on limited GPU)
from flexllmgen.flex_opt import Policy
from flexllmgen.compression import CompressionConfig
# Offload weights and cache to CPU/disk with 4-bit compression
policy_offload = Policy(
gpu_batch_size=2,
num_gpu_batches=4,
# 0% weights on GPU, 50% on CPU, 50% on disk
w_gpu_percent=0,
w_cpu_percent=50,
# 0% KV cache on GPU, 50% on CPU, 50% on disk
cache_gpu_percent=0,
cache_cpu_percent=50,
# 0% activations on GPU, 100% on CPU, 0% on disk
act_gpu_percent=0,
act_cpu_percent=100,
overlap=True,
sep_layer=False,
pin_weight=True,
cpu_cache_compute=False,
attn_sparsity=1.0,
compress_weight=True,
comp_weight_config=CompressionConfig(num_bits=4, group_size=64, group_dim=0, symmetric=False, enabled=True),
compress_cache=True,
comp_cache_config=CompressionConfig(num_bits=4, group_size=64, group_dim=2, symmetric=False, enabled=True),
)
Related Pages
- Principle:FMInference_FlexLLMGen_Offloading_Policy_Configuration
- Environment:FMInference_FlexLLMGen_CUDA_GPU
- Heuristic:FMInference_FlexLLMGen_OOM_Memory_Management
- Heuristic:FMInference_FlexLLMGen_Offloading_Percent_Tuning
- Heuristic:FMInference_FlexLLMGen_Pin_Memory_Tradeoffs
- Heuristic:FMInference_FlexLLMGen_Weight_Compression_4bit