Implementation:FMInference FlexLLMGen DeepSpeed Replace Policy

Field	Value
Sources	Repo: FlexLLMGen, Upstream: DeepSpeed
Domains	Inference_Optimization, Architecture_Abstraction
Last Updated	2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed module that defines architecture-specific replacement policies for extracting transformer layer parameters from different model families.

Description

The replace_policy.py file (748 lines) is a vendored copy of DeepSpeed's policy-based module replacement system. It defines an abstract base class hierarchy and concrete policy implementations for many HuggingFace transformer architectures.

Key components include:

DSPolicy (abstract base) -- Declares the interface all replacement policies must implement, including an attention() method and a _orig_layer_class attribute for matching.
TransformerPolicy -- Extends DSPolicy with configuration for inference mode, linear layer usage, attention scaling, Megatron v2 support, MLP activation function type, pre-attention layer norm, checkpoint prefix usage, and QKV split format.
UNetPolicy / VAEPolicy -- Non-transformer policies for diffusion model components (Stable Diffusion UNet and VAE), implementing match() and apply() methods.

Concrete transformer policies include:

HFBertLayerPolicy -- BERT and RoBERTa layers
HFGPT2LayerPolicy -- GPT-2 layers
HFGPTJLayerPolicy -- GPT-J layers
HFGPTNEOLayerPolicy -- GPT-Neo layers
BLOOMLayerPolicy -- BLOOM layers
HFOPTLayerPolicy -- OPT layers
MegatronLayerPolicy -- Megatron-LM layers

Each policy implements four key methods: attention() (returns QKV and dense weights/biases), mlp() (returns intermediate and output weights/biases), layerNorm() (returns pre/post layer norm parameters), and get_hidden_heads() (returns hidden size and number of attention heads).

Usage

Policies are registered in the replace_policies and generic_policies lists and are automatically matched against model layers during the replace_module process. They are not typically instantiated directly by users.

Code Reference

Field	Value
Repository	FlexLLMGen
File	benchmark/third_party/DeepSpeed/deepspeed/module_inject/replace_policy.py
Lines	1-748
Type	AUTO_KEEP (vendored dependency)

Key class signatures:

class DSPolicy(ABC):
    _orig_layer_class = None
    def attention(self):
        raise NotImplementedError

class TransformerPolicy(DSPolicy):
    def __init__(self, inference=True, linear_layer=True,
                 scale_attention=True, megatron_v2=False,
                 mlp_act_func_type=ActivationFuncType.GELU,
                 pre_attn_norm=True, use_load_prefix=False,
                 split_qkv=True):
        ...
    def attention(self): ...
    def get_hidden_heads(self): ...
    def mlp(self): ...
    def layerNorm(self): ...
    def get_param_names(self): ...

I/O Contract

Inputs

Parameter	Type	Required	Description
client_module	torch.nn.Module	Yes	The original transformer layer to extract parameters from
inference	bool	No	Whether to configure for inference mode (default: True)
linear_layer	bool	No	Use linear layer optimization (default: True)
scale_attention	bool	No	Scale attention scores by sqrt(d) (default: True)

Outputs

Output	Type	Description
attention params	tuple	QKV weight, QKV bias, dense weight, dense bias, and configuration flags
mlp params	tuple	Intermediate weight/bias, output weight/bias
layerNorm params	tuple	Gamma and beta parameters for layer normalization
hidden_heads	tuple	(hidden_size, num_attention_heads)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment