Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:FMInference FlexLLMGen Offloaded Text Generation

From Leeroopedia


Metadata

Field Value
Paper FlexGen
Repo FlexLLMGen

Domains

  • Inference_Optimization
  • Text_Generation

Overview

A text generation strategy that produces tokens autoregressively while managing tensor transfers across GPU, CPU, and disk using block-scheduled I/O-compute overlap.

Description

Text generation with offloaded models requires careful scheduling of weight loading, cache read/write, and computation across the three-tier memory hierarchy. FlexLLMGen implements three generation strategies:

  • Normal - sequential execution without overlap, useful for debugging
  • Overlap single batch - overlaps I/O with compute for one batch
  • Overlap multi-batch - pipeline-style overlap across multiple GPU batches, achieving highest throughput

The generation loop iterates over (generation_step, layer, gpu_batch) and coordinates weight prefetching, cache management, and attention mask updates.

Usage

Use OptLM.generate() after model loading. Choose overlap mode via Policy.overlap and Policy.num_gpu_batches. For throughput, use overlap=True with multiple GPU batches.

Theoretical Basis

The block schedule treats generation as a 3D iteration space: (token_step i, layer j, gpu_batch k). By prefetching weights for layer j+1 while computing layer j, and similarly for cache, the system hides I/O latency. Multi-batch mode further pipelines by overlapping different batches at different pipeline stages.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment