Principle:FMInference FlexLLMGen Model Replacement Policy
| Field | Value |
|---|---|
| Sources | Paper: FlexGen, Upstream: DeepSpeed |
| Domains | Inference_Optimization, Architecture_Abstraction |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A strategy pattern that abstracts architecture-specific parameter extraction behind a uniform interface, enabling a single replacement engine to optimize many different transformer architectures.
Description
Model replacement policies solve the problem of transformer architecture diversity. While all transformer models share the same conceptual structure (attention, MLP, layer norm), each implementation stores parameters differently: BERT concatenates QKV in one way, GPT-2 in another, OPT uses a different naming convention, BLOOM has a distinct bias structure, and so on.
The policy pattern provides a clean separation of concerns:
- The replacement engine knows how to construct optimized inference kernels and how to copy weights with tensor slicing for model parallelism, but does not know the internal structure of any specific model architecture.
- The policy class knows exactly where each parameter lives within a specific architecture's module hierarchy, but does not know how to construct the optimized replacement.
Each policy class implements a contract with four key methods:
- attention() -- Extracts and returns QKV weights/biases and the dense output projection, handling architecture-specific QKV concatenation order.
- mlp() -- Extracts intermediate (up-projection) and output (down-projection) weights/biases.
- layerNorm() -- Extracts layer normalization gamma/beta parameters for both pre-attention and post-attention positions.
- get_hidden_heads() -- Returns the hidden dimension size and number of attention heads.
The _orig_layer_class attribute enables automatic matching: the replacement engine inspects each module in the model and finds the policy whose _orig_layer_class matches the module's type.
Usage
This pattern is useful whenever a system needs to support multiple model architectures with a single optimization pass. It is the standard approach in DeepSpeed Inference and is used in FlexLLMGen's benchmark suite for baseline comparison.
Theoretical Basis
The policy pattern is a behavioral design pattern that enables selecting an algorithm at runtime. In this context, each "algorithm" is a set of parameter extraction rules specific to a model architecture. This pattern achieves O(1) effort to add new architecture support (implement one policy class) without modifying the replacement engine.