Heuristic:FMInference FlexLLMGen Offloading Percent Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLM_Inference |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Guide for tuning FlexLLMGen's six-number --percent parameter to distribute weights, KV cache, and activations across GPU, CPU, and disk for optimal throughput.
Description
The --percent argument accepts six integers that control what percentage of each tensor type (weights, attention cache, activations) resides on GPU vs CPU, with the remainder going to disk. The distribution uses a cumulative midpoint algorithm to assign each weight tensor to a device based on its position in the cumulative size distribution. Choosing the right percentages is critical for maximizing throughput while avoiding out-of-memory.
Usage
Use this heuristic when configuring FlexLLMGen for a new model/hardware combination. The default 100 0 100 0 100 0 keeps everything on GPU, which only works for models that fit entirely in VRAM. For larger models, you need to manually tune these percentages based on available GPU VRAM, CPU DRAM, and disk capacity.
The Insight (Rule of Thumb)
- Format:
--percent W_GPU W_CPU C_GPU C_CPU A_GPU A_CPU(six numbers, each 0-100). - Disk allocation is implicit:
disk% = 100 - GPU% - CPU%for each tensor type. - Default (small models):
100 0 100 0 100 0= everything on GPU. - CPU offload (medium models):
0 100 100 0 100 0= weights on CPU, cache and activations on GPU. Needs ~90GB CPU RAM for OPT-30B. - Disk offload (large models):
0 0 100 0 100 0= weights on disk, cache and activations on GPU. Minimal CPU/GPU memory needed. - Priority order: Keep activations on GPU (fastest), then cache, then weights. Weights are the largest but only read once per layer per batch.
- Batch size interaction: Larger
--gpu-batch-sizeincreases throughput but requires more cache/activation memory. Start with--gpu-batch-size 4(default) and increase until OOM. - Trade-off: More offloading to CPU/disk reduces memory pressure but increases I/O latency. FlexLLMGen's block schedule and I/O overlap (
--overlap) mitigate this.
Reasoning
FlexLLMGen's throughput-oriented design maximizes batch size to amortize I/O costs. The block schedule (processing all micro-batches for one layer before moving to the next) means each weight tensor is loaded once and reused across all micro-batches. This makes weight offloading relatively cheap compared to cache or activation offloading.
The benchmark suite in bench_suite.py shows empirically optimal configurations:
- OPT-6.7B on T4 (16GB): Can fit entirely on GPU with batch size 2-4.
- OPT-30B on T4 (16GB): Requires CPU offloading for weights; batch size 144 achieves 7.32 tok/s.
- OPT-175B on T4 (16GB): Requires disk offloading for weights; batch size 256 achieves 0.69 tok/s (1.12 tok/s with compression).
Code Evidence
Six-number percent argument from flexllmgen/flex_opt.py:1291-1299:
parser.add_argument("--percent", nargs="+", type=int,
default=[100, 0, 100, 0, 100, 0],
help="Six numbers. They are "
"the percentage of weight on GPU, "
"the percentage of weight on CPU, "
"the percentage of attention cache on GPU, "
"the percentage of attention cache on CPU, "
"the percentage of activations on GPU, "
"the percentage of activations on CPU")
Weight placement by cumulative percentage from flexllmgen/flex_opt.py:82-89:
def get_choice(cur_percent, percents, choices):
percents = np.cumsum(percents)
assert np.abs(percents[-1] - 100) < 1e-5
for i in range(len(percents)):
if cur_percent < percents[i]:
return choices[i]
return choices[-1]
README FAQ on strategy tuning at README.md:156-160:
We will release an automatic policy optimizer later, but now you have to manually
try a few strategies. The idea of high-throughput generation is to offload
parameters and attention cache as much as possible to the CPU and disk if necessary.
You can see the reference strategies in our benchmark.
To avoid out-of-memory, you can tune the --percent to offload more tensors to the
CPU and disk.