Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen DeepSpeed Replace Policy

From Leeroopedia


Field Value
Sources Repo: FlexLLMGen, Upstream: DeepSpeed
Domains Inference_Optimization, Architecture_Abstraction
Last Updated 2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed module that defines architecture-specific replacement policies for extracting transformer layer parameters from different model families.

Description

The replace_policy.py file (748 lines) is a vendored copy of DeepSpeed's policy-based module replacement system. It defines an abstract base class hierarchy and concrete policy implementations for many HuggingFace transformer architectures.

Key components include:

  • DSPolicy (abstract base) -- Declares the interface all replacement policies must implement, including an attention() method and a _orig_layer_class attribute for matching.
  • TransformerPolicy -- Extends DSPolicy with configuration for inference mode, linear layer usage, attention scaling, Megatron v2 support, MLP activation function type, pre-attention layer norm, checkpoint prefix usage, and QKV split format.
  • UNetPolicy / VAEPolicy -- Non-transformer policies for diffusion model components (Stable Diffusion UNet and VAE), implementing match() and apply() methods.

Concrete transformer policies include:

  • HFBertLayerPolicy -- BERT and RoBERTa layers
  • HFGPT2LayerPolicy -- GPT-2 layers
  • HFGPTJLayerPolicy -- GPT-J layers
  • HFGPTNEOLayerPolicy -- GPT-Neo layers
  • BLOOMLayerPolicy -- BLOOM layers
  • HFOPTLayerPolicy -- OPT layers
  • MegatronLayerPolicy -- Megatron-LM layers

Each policy implements four key methods: attention() (returns QKV and dense weights/biases), mlp() (returns intermediate and output weights/biases), layerNorm() (returns pre/post layer norm parameters), and get_hidden_heads() (returns hidden size and number of attention heads).

Usage

Policies are registered in the replace_policies and generic_policies lists and are automatically matched against model layers during the replace_module process. They are not typically instantiated directly by users.

Code Reference

Field Value
Repository FlexLLMGen
File benchmark/third_party/DeepSpeed/deepspeed/module_inject/replace_policy.py
Lines 1-748
Type AUTO_KEEP (vendored dependency)

Key class signatures:

class DSPolicy(ABC):
    _orig_layer_class = None
    def attention(self):
        raise NotImplementedError

class TransformerPolicy(DSPolicy):
    def __init__(self, inference=True, linear_layer=True,
                 scale_attention=True, megatron_v2=False,
                 mlp_act_func_type=ActivationFuncType.GELU,
                 pre_attn_norm=True, use_load_prefix=False,
                 split_qkv=True):
        ...
    def attention(self): ...
    def get_hidden_heads(self): ...
    def mlp(self): ...
    def layerNorm(self): ...
    def get_param_names(self): ...

I/O Contract

Inputs

Parameter Type Required Description
client_module torch.nn.Module Yes The original transformer layer to extract parameters from
inference bool No Whether to configure for inference mode (default: True)
linear_layer bool No Use linear layer optimization (default: True)
scale_attention bool No Scale attention scores by sqrt(d) (default: True)

Outputs

Output Type Description
attention params tuple QKV weight, QKV bias, dense weight, dense bias, and configuration flags
mlp params tuple Intermediate weight/bias, output weight/bias
layerNorm params tuple Gamma and beta parameters for layer normalization
hidden_heads tuple (hidden_size, num_attention_heads)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment