Implementation:FMInference FlexLLMGen DeepSpeed Replace Module

Field	Value
Sources	Repo: FlexLLMGen, Upstream: DeepSpeed
Domains	Inference_Optimization, Module_Replacement
Last Updated	2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed module that replaces standard PyTorch transformer layers with optimized DeepSpeed inference kernels for faster execution.

Description

The replace_module.py file (1197 lines) is a vendored copy of DeepSpeed's inference module replacement system, included in FlexLLMGen's benchmark third-party dependencies. It provides the machinery to automatically identify transformer layers in a model and swap them with DeepSpeed's fused CUDA inference kernels.

Key components include:

ReplaceWithTensorSlicing -- Handles copying weights between original and replacement modules with support for model-parallel tensor slicing. The class provides qkv_copy for attention QKV weight redistribution across GPU ranks and copy for general weight slicing along input/output dimensions.
GroupQuantizer -- Performs int8 group quantization of weight tensors for inference, splitting weights into groups and computing per-group scale factors for efficient dequantization.
get_transformer_name -- Traverses a module's children to locate the transformer layer list by matching against known supported_models.
_module_match -- Iterates over generic_policies (UNet, VAE, etc.) to find a matching replacement policy for non-transformer modules such as diffusion model components.

The replacement process walks the model's module hierarchy, identifies layers matching registered policies, extracts their weights (attention QKV, dense, MLP, layer norms), and constructs optimized DeepSpeed inference transformer blocks in their place. Weight transfer respects model parallelism by slicing tensors across GPU ranks.

Usage

This module is invoked internally by DeepSpeed's deepspeed.init_inference() API. In the FlexLLMGen benchmark suite, it is part of the vendored DeepSpeed package used for baseline comparisons against FlexGen's offloading approach.

Code Reference

Field	Value
Repository	FlexLLMGen
File	benchmark/third_party/DeepSpeed/deepspeed/module_inject/replace_module.py
Lines	1-1197
Type	AUTO_KEEP (vendored dependency)

Key class signatures:

class ReplaceWithTensorSlicing:
    def __init__(self, mp_group=None, mp_size=1, out_dim=1, in_dim=0):
        ...
    def qkv_copy(self, dst, src):
        ...
    def copy(self, dst, src):
        ...

class GroupQuantizer:
    def __init__(self, q_int8=True, group_size=1, num_bits=8):
        ...
    def quantize(self, inputs, qkv=True, count=1, parallel_dim=0):
        ...

I/O Contract

Inputs

Parameter	Type	Required	Description
replaced_module	torch.nn.Module	Yes	The original model whose transformer layers will be replaced
mp_group	torch.distributed.ProcessGroup	No	Model parallel process group for tensor slicing
mp_size	int	No	Model parallel world size (default: 1)
q_int8	bool	No	Enable int8 group quantization (default: True)

Outputs

Output	Type	Description
modified_module	torch.nn.Module	Model with transformer layers replaced by DeepSpeed inference kernels
quantized_weights	torch.nn.Parameter	Int8 quantized weights with per-group scale factors (when quantization enabled)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment