Principle:FMInference FlexLLMGen Offloaded Text Generation

Metadata

Field	Value
Paper	FlexGen
Repo	FlexLLMGen

Domains

Inference_Optimization
Text_Generation

Overview

A text generation strategy that produces tokens autoregressively while managing tensor transfers across GPU, CPU, and disk using block-scheduled I/O-compute overlap.

Description

Text generation with offloaded models requires careful scheduling of weight loading, cache read/write, and computation across the three-tier memory hierarchy. FlexLLMGen implements three generation strategies:

Normal - sequential execution without overlap, useful for debugging
Overlap single batch - overlaps I/O with compute for one batch
Overlap multi-batch - pipeline-style overlap across multiple GPU batches, achieving highest throughput

The generation loop iterates over (generation_step, layer, gpu_batch) and coordinates weight prefetching, cache management, and attention mask updates.

Usage

Use OptLM.generate() after model loading. Choose overlap mode via Policy.overlap and Policy.num_gpu_batches. For throughput, use overlap=True with multiple GPU batches.

Theoretical Basis

The block schedule treats generation as a 3D iteration space: (token_step i, layer j, gpu_batch k). By prefetching weights for layer j+1 while computing layer j, and similarly for cache, the system hides I/O latency. Multi-batch mode further pipelines by overlapping different batches at different pipeline stages.

Related Pages

Implementation:FMInference_FlexLLMGen_OptLM_Generate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment