Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FMInference FlexLLMGen DeepSpeed Replace Module

From Leeroopedia
Revision as of 14:56, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/FMInference_FlexLLMGen_DeepSpeed_Replace_Module.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Field Value
Sources Repo: FlexLLMGen, Upstream: DeepSpeed
Domains Inference_Optimization, Module_Replacement
Last Updated 2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed module that replaces standard PyTorch transformer layers with optimized DeepSpeed inference kernels for faster execution.

Description

The replace_module.py file (1197 lines) is a vendored copy of DeepSpeed's inference module replacement system, included in FlexLLMGen's benchmark third-party dependencies. It provides the machinery to automatically identify transformer layers in a model and swap them with DeepSpeed's fused CUDA inference kernels.

Key components include:

  • ReplaceWithTensorSlicing -- Handles copying weights between original and replacement modules with support for model-parallel tensor slicing. The class provides qkv_copy for attention QKV weight redistribution across GPU ranks and copy for general weight slicing along input/output dimensions.
  • GroupQuantizer -- Performs int8 group quantization of weight tensors for inference, splitting weights into groups and computing per-group scale factors for efficient dequantization.
  • get_transformer_name -- Traverses a module's children to locate the transformer layer list by matching against known supported_models.
  • _module_match -- Iterates over generic_policies (UNet, VAE, etc.) to find a matching replacement policy for non-transformer modules such as diffusion model components.

The replacement process walks the model's module hierarchy, identifies layers matching registered policies, extracts their weights (attention QKV, dense, MLP, layer norms), and constructs optimized DeepSpeed inference transformer blocks in their place. Weight transfer respects model parallelism by slicing tensors across GPU ranks.

Usage

This module is invoked internally by DeepSpeed's deepspeed.init_inference() API. In the FlexLLMGen benchmark suite, it is part of the vendored DeepSpeed package used for baseline comparisons against FlexGen's offloading approach.

Code Reference

Field Value
Repository FlexLLMGen
File benchmark/third_party/DeepSpeed/deepspeed/module_inject/replace_module.py
Lines 1-1197
Type AUTO_KEEP (vendored dependency)

Key class signatures:

class ReplaceWithTensorSlicing:
    def __init__(self, mp_group=None, mp_size=1, out_dim=1, in_dim=0):
        ...
    def qkv_copy(self, dst, src):
        ...
    def copy(self, dst, src):
        ...

class GroupQuantizer:
    def __init__(self, q_int8=True, group_size=1, num_bits=8):
        ...
    def quantize(self, inputs, qkv=True, count=1, parallel_dim=0):
        ...

I/O Contract

Inputs

Parameter Type Required Description
replaced_module torch.nn.Module Yes The original model whose transformer layers will be replaced
mp_group torch.distributed.ProcessGroup No Model parallel process group for tensor slicing
mp_size int No Model parallel world size (default: 1)
q_int8 bool No Enable int8 group quantization (default: True)

Outputs

Output Type Description
modified_module torch.nn.Module Model with transformer layers replaced by DeepSpeed inference kernels
quantized_weights torch.nn.Parameter Int8 quantized weights with per-group scale factors (when quantization enabled)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment