Implementation:FMInference FlexLLMGen DeepSpeed Replace Module
| Field | Value |
|---|---|
| Sources | Repo: FlexLLMGen, Upstream: DeepSpeed |
| Domains | Inference_Optimization, Module_Replacement |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Vendored DeepSpeed module that replaces standard PyTorch transformer layers with optimized DeepSpeed inference kernels for faster execution.
Description
The replace_module.py file (1197 lines) is a vendored copy of DeepSpeed's inference module replacement system, included in FlexLLMGen's benchmark third-party dependencies. It provides the machinery to automatically identify transformer layers in a model and swap them with DeepSpeed's fused CUDA inference kernels.
Key components include:
- ReplaceWithTensorSlicing -- Handles copying weights between original and replacement modules with support for model-parallel tensor slicing. The class provides qkv_copy for attention QKV weight redistribution across GPU ranks and copy for general weight slicing along input/output dimensions.
- GroupQuantizer -- Performs int8 group quantization of weight tensors for inference, splitting weights into groups and computing per-group scale factors for efficient dequantization.
- get_transformer_name -- Traverses a module's children to locate the transformer layer list by matching against known supported_models.
- _module_match -- Iterates over generic_policies (UNet, VAE, etc.) to find a matching replacement policy for non-transformer modules such as diffusion model components.
The replacement process walks the model's module hierarchy, identifies layers matching registered policies, extracts their weights (attention QKV, dense, MLP, layer norms), and constructs optimized DeepSpeed inference transformer blocks in their place. Weight transfer respects model parallelism by slicing tensors across GPU ranks.
Usage
This module is invoked internally by DeepSpeed's deepspeed.init_inference() API. In the FlexLLMGen benchmark suite, it is part of the vendored DeepSpeed package used for baseline comparisons against FlexGen's offloading approach.
Code Reference
| Field | Value |
|---|---|
| Repository | FlexLLMGen |
| File | benchmark/third_party/DeepSpeed/deepspeed/module_inject/replace_module.py |
| Lines | 1-1197 |
| Type | AUTO_KEEP (vendored dependency) |
Key class signatures:
class ReplaceWithTensorSlicing:
def __init__(self, mp_group=None, mp_size=1, out_dim=1, in_dim=0):
...
def qkv_copy(self, dst, src):
...
def copy(self, dst, src):
...
class GroupQuantizer:
def __init__(self, q_int8=True, group_size=1, num_bits=8):
...
def quantize(self, inputs, qkv=True, count=1, parallel_dim=0):
...
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| replaced_module | torch.nn.Module | Yes | The original model whose transformer layers will be replaced |
| mp_group | torch.distributed.ProcessGroup | No | Model parallel process group for tensor slicing |
| mp_size | int | No | Model parallel world size (default: 1) |
| q_int8 | bool | No | Enable int8 group quantization (default: True) |
Outputs
| Output | Type | Description |
|---|---|---|
| modified_module | torch.nn.Module | Model with transformer layers replaced by DeepSpeed inference kernels |
| quantized_weights | torch.nn.Parameter | Int8 quantized weights with per-group scale factors (when quantization enabled) |