Principle:FMInference FlexLLMGen Offloaded Model Loading
Metadata
| Field | Value |
|---|---|
| Paper | FlexGen |
| Repo | FlexLLMGen |
Domains
- Inference_Optimization
- Model_Loading
Overview
A model initialization technique that distributes Transformer layer weights across GPU, CPU, and disk according to a policy, enabling loading of models far larger than available GPU memory.
Description
Traditional model loading places all weights on GPU. Offloaded model loading instead assigns each weight tensor to a device tier (GPU, CPU, or disk) based on the Policy's percentage allocations. Weights are distributed by cumulative size: the first w_gpu_percent of weight bytes go to GPU, the next w_cpu_percent to CPU, and the remainder to disk. The model builds a layer pipeline of InputEmbed, TransformerLayer (or separate SelfAttention+MLP), and OutputEmbed modules. CUDA streams are allocated for async weight/cache loading.
Usage
Use OptLM to load models that exceed GPU memory. The Policy's percentage parameters control memory distribution. With sep_layer=True, attention and MLP are separate layers enabling finer-grained pipelining.
Theoretical Basis
Layer-wise weight distribution follows a cumulative percentage model. For N total weight bytes and percentages (p_gpu, p_cpu, p_disk), each weight tensor is assigned to the device whose cumulative range it falls within. This ensures a roughly proportional split regardless of individual tensor sizes. CUDA streams enable async I/O overlap with computation.