Principle:FMInference FlexLLMGen Offloaded Model Loading

Metadata

Field	Value
Paper	FlexGen
Repo	FlexLLMGen

Domains

Inference_Optimization
Model_Loading

Overview

A model initialization technique that distributes Transformer layer weights across GPU, CPU, and disk according to a policy, enabling loading of models far larger than available GPU memory.

Description

Traditional model loading places all weights on GPU. Offloaded model loading instead assigns each weight tensor to a device tier (GPU, CPU, or disk) based on the Policy's percentage allocations. Weights are distributed by cumulative size: the first w_gpu_percent of weight bytes go to GPU, the next w_cpu_percent to CPU, and the remainder to disk. The model builds a layer pipeline of InputEmbed, TransformerLayer (or separate SelfAttention+MLP), and OutputEmbed modules. CUDA streams are allocated for async weight/cache loading.

Usage

Use OptLM to load models that exceed GPU memory. The Policy's percentage parameters control memory distribution. With sep_layer=True, attention and MLP are separate layers enabling finer-grained pipelining.

Theoretical Basis

Layer-wise weight distribution follows a cumulative percentage model. For N total weight bytes and percentages (p_gpu, p_cpu, p_disk), each weight tensor is assigned to the device whose cumulative range it falls within. This ensures a roughly proportional split regardless of individual tensor sizes. CUDA streams enable async I/O overlap with computation.

Related Pages

Implementation:FMInference_FlexLLMGen_OptLM_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment