Principle:FMInference FlexLLMGen Offloaded Text Generation
Metadata
| Field | Value |
|---|---|
| Paper | FlexGen |
| Repo | FlexLLMGen |
Domains
- Inference_Optimization
- Text_Generation
Overview
A text generation strategy that produces tokens autoregressively while managing tensor transfers across GPU, CPU, and disk using block-scheduled I/O-compute overlap.
Description
Text generation with offloaded models requires careful scheduling of weight loading, cache read/write, and computation across the three-tier memory hierarchy. FlexLLMGen implements three generation strategies:
- Normal - sequential execution without overlap, useful for debugging
- Overlap single batch - overlaps I/O with compute for one batch
- Overlap multi-batch - pipeline-style overlap across multiple GPU batches, achieving highest throughput
The generation loop iterates over (generation_step, layer, gpu_batch) and coordinates weight prefetching, cache management, and attention mask updates.
Usage
Use OptLM.generate() after model loading. Choose overlap mode via Policy.overlap and Policy.num_gpu_batches. For throughput, use overlap=True with multiple GPU batches.
Theoretical Basis
The block schedule treats generation as a 3D iteration space: (token_step i, layer j, gpu_batch k). By prefetching weights for layer j+1 while computing layer j, and similarly for cache, the system hides I/O latency. Multi-batch mode further pipelines by overlapping different batches at different pipeline stages.