Implementation:FMInference FlexLLMGen DeepSpeed Replace Policy
| Field | Value |
|---|---|
| Sources | Repo: FlexLLMGen, Upstream: DeepSpeed |
| Domains | Inference_Optimization, Architecture_Abstraction |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Vendored DeepSpeed module that defines architecture-specific replacement policies for extracting transformer layer parameters from different model families.
Description
The replace_policy.py file (748 lines) is a vendored copy of DeepSpeed's policy-based module replacement system. It defines an abstract base class hierarchy and concrete policy implementations for many HuggingFace transformer architectures.
Key components include:
- DSPolicy (abstract base) -- Declares the interface all replacement policies must implement, including an attention() method and a _orig_layer_class attribute for matching.
- TransformerPolicy -- Extends DSPolicy with configuration for inference mode, linear layer usage, attention scaling, Megatron v2 support, MLP activation function type, pre-attention layer norm, checkpoint prefix usage, and QKV split format.
- UNetPolicy / VAEPolicy -- Non-transformer policies for diffusion model components (Stable Diffusion UNet and VAE), implementing match() and apply() methods.
Concrete transformer policies include:
- HFBertLayerPolicy -- BERT and RoBERTa layers
- HFGPT2LayerPolicy -- GPT-2 layers
- HFGPTJLayerPolicy -- GPT-J layers
- HFGPTNEOLayerPolicy -- GPT-Neo layers
- BLOOMLayerPolicy -- BLOOM layers
- HFOPTLayerPolicy -- OPT layers
- MegatronLayerPolicy -- Megatron-LM layers
Each policy implements four key methods: attention() (returns QKV and dense weights/biases), mlp() (returns intermediate and output weights/biases), layerNorm() (returns pre/post layer norm parameters), and get_hidden_heads() (returns hidden size and number of attention heads).
Usage
Policies are registered in the replace_policies and generic_policies lists and are automatically matched against model layers during the replace_module process. They are not typically instantiated directly by users.
Code Reference
| Field | Value |
|---|---|
| Repository | FlexLLMGen |
| File | benchmark/third_party/DeepSpeed/deepspeed/module_inject/replace_policy.py |
| Lines | 1-748 |
| Type | AUTO_KEEP (vendored dependency) |
Key class signatures:
class DSPolicy(ABC):
_orig_layer_class = None
def attention(self):
raise NotImplementedError
class TransformerPolicy(DSPolicy):
def __init__(self, inference=True, linear_layer=True,
scale_attention=True, megatron_v2=False,
mlp_act_func_type=ActivationFuncType.GELU,
pre_attn_norm=True, use_load_prefix=False,
split_qkv=True):
...
def attention(self): ...
def get_hidden_heads(self): ...
def mlp(self): ...
def layerNorm(self): ...
def get_param_names(self): ...
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| client_module | torch.nn.Module | Yes | The original transformer layer to extract parameters from |
| inference | bool | No | Whether to configure for inference mode (default: True) |
| linear_layer | bool | No | Use linear layer optimization (default: True) |
| scale_attention | bool | No | Scale attention scores by sqrt(d) (default: True) |
Outputs
| Output | Type | Description |
|---|---|---|
| attention params | tuple | QKV weight, QKV bias, dense weight, dense bias, and configuration flags |
| mlp params | tuple | Intermediate weight/bias, output weight/bias |
| layerNorm params | tuple | Gamma and beta parameters for layer normalization |
| hidden_heads | tuple | (hidden_size, num_attention_heads) |