Implementation:FMInference FlexLLMGen DeepSpeed Compression Layers

Field	Value
Sources	Repo: FlexLLMGen
Domains	Model_Compression, Quantization, Pruning, Neural_Network_Layers
Last Updated	2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed compression-aware neural network layer implementations that extend standard PyTorch layers (Linear, Conv2d, Embedding, BatchNorm2d) with integrated support for weight quantization, activation quantization, sparse pruning, row/head pruning, and channel pruning.

Description

basic_layer.py provides drop-in replacement layers for PyTorch's nn.Linear, nn.Conv2d, nn.Embedding, and nn.BatchNorm2d that can be dynamically enabled with various compression techniques during training. The compression is applied in the forward pass and can be gradually introduced using schedule offsets.

Key classes:

QuantAct -- Standalone activation quantization module using exponential moving average for range calibration (static mode) or per-forward min/max (dynamic mode). Supports symmetric and asymmetric quantization.
Embedding_Compress -- Extends nn.Embedding with weight quantization support. Uses token-wise quantization groups (one group per embedding row).
LinearLayer_Compress -- Extends nn.Linear with:
- Sparse pruning -- L1-norm or TopK masking of individual weights.
- Row pruning -- L1-norm or TopK masking of entire output rows.
- Head pruning -- TopK masking of attention heads (applied to the O matrix).
- Weight quantization -- Symmetric/asymmetric quantization with configurable bit-width (1-bit binary, 2-bit ternary, 4/8-bit standard).
- Activation quantization -- Static (EMA range) or dynamic (per-forward range) quantization of input activations.
Conv2dLayer_Compress -- Extends nn.Conv2d with sparse pruning, channel pruning, weight quantization, and activation quantization.
BNLayer_Compress -- Extends nn.BatchNorm2d with channel pruning support for dimension reduction.
ColumnParallelLinear_Compress and RowParallelLinear_Compress -- Model-parallel variants of LinearLayer_Compress that partition weights across tensor-parallel groups while preserving all compression capabilities.
Model-parallel autograd functions -- _CopyToModelParallelRegion, _ReduceFromModelParallelRegion, _ScatterToModelParallelRegion, _GatherFromModelParallelRegion implement the forward/backward communication patterns for tensor parallelism.

This is AUTO_KEEP vendored code from DeepSpeed.

Code Reference

Field	Value
Repository	FlexLLMGen
File	benchmark/third_party/DeepSpeed/deepspeed/compression/basic_layer.py
Lines	1-923

Key Classes:

class QuantAct(nn.Module):
    def __init__(self, act_range_momentum=0.95, quant_mode='symmetric'): ...
    def forward(self, x, num_bits, *args): ...

class Embedding_Compress(nn.Embedding):
    def enable_weight_quantization(self, start_bits, target_bits, ...): ...

class LinearLayer_Compress(nn.Linear):
    def enable_sparse_pruning(self, ratio, method): ...
    def enable_row_pruning(self, ratio, method): ...
    def enable_head_pruning(self, ratio, method, num_heads): ...
    def enable_weight_quantization(self, start_bits, target_bits, ...): ...
    def enable_activation_quantization(self, bits, quantization_type, range_calibration): ...
    def forward(self, input, skip_bias_add=False): ...

class Conv2dLayer_Compress(nn.Conv2d):
    def enable_sparse_pruning(self, ratio, method): ...
    def enable_channel_pruning(self, ratio, method): ...

class ColumnParallelLinear_Compress(LinearLayer_Compress): ...
class RowParallelLinear_Compress(LinearLayer_Compress): ...

I/O Contract

LinearLayer_Compress.forward

Parameter	Type	Required	Description
input	Tensor	Yes	Input activation tensor
skip_bias_add	bool	No	If True, returns (output, bias) separately for fused bias-add patterns

Output	Type	Description
output	Tensor	Compressed forward pass result with all enabled compression techniques applied
bias	Tensor or None	Bias tensor (only when skip_bias_add=True)

Related Pages

Principle:FMInference_FlexLLMGen_Compression_Aware_Training_Layers

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment