Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:FMInference FlexLLMGen Offloaded Model Loading

From Leeroopedia


Metadata

Field Value
Paper FlexGen
Repo FlexLLMGen

Domains

  • Inference_Optimization
  • Model_Loading

Overview

A model initialization technique that distributes Transformer layer weights across GPU, CPU, and disk according to a policy, enabling loading of models far larger than available GPU memory.

Description

Traditional model loading places all weights on GPU. Offloaded model loading instead assigns each weight tensor to a device tier (GPU, CPU, or disk) based on the Policy's percentage allocations. Weights are distributed by cumulative size: the first w_gpu_percent of weight bytes go to GPU, the next w_cpu_percent to CPU, and the remainder to disk. The model builds a layer pipeline of InputEmbed, TransformerLayer (or separate SelfAttention+MLP), and OutputEmbed modules. CUDA streams are allocated for async weight/cache loading.

Usage

Use OptLM to load models that exceed GPU memory. The Policy's percentage parameters control memory distribution. With sep_layer=True, attention and MLP are separate layers enabling finer-grained pipelining.

Theoretical Basis

Layer-wise weight distribution follows a cumulative percentage model. For N total weight bytes and percentages (p_gpu, p_cpu, p_disk), each weight tensor is assigned to the device whose cumulative range it falls within. This ensures a roughly proportional split regardless of individual tensor sizes. CUDA streams enable async I/O overlap with computation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment