Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:FMInference FlexLLMGen DeepSpeed Compression Layers

From Leeroopedia


Field Value
Sources Repo: FlexLLMGen
Domains Model_Compression, Quantization, Pruning, Neural_Network_Layers
Last Updated 2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed compression-aware neural network layer implementations that extend standard PyTorch layers (Linear, Conv2d, Embedding, BatchNorm2d) with integrated support for weight quantization, activation quantization, sparse pruning, row/head pruning, and channel pruning.

Description

basic_layer.py provides drop-in replacement layers for PyTorch's nn.Linear, nn.Conv2d, nn.Embedding, and nn.BatchNorm2d that can be dynamically enabled with various compression techniques during training. The compression is applied in the forward pass and can be gradually introduced using schedule offsets.

Key classes:

  • QuantAct -- Standalone activation quantization module using exponential moving average for range calibration (static mode) or per-forward min/max (dynamic mode). Supports symmetric and asymmetric quantization.
  • Embedding_Compress -- Extends nn.Embedding with weight quantization support. Uses token-wise quantization groups (one group per embedding row).
  • LinearLayer_Compress -- Extends nn.Linear with:
    • Sparse pruning -- L1-norm or TopK masking of individual weights.
    • Row pruning -- L1-norm or TopK masking of entire output rows.
    • Head pruning -- TopK masking of attention heads (applied to the O matrix).
    • Weight quantization -- Symmetric/asymmetric quantization with configurable bit-width (1-bit binary, 2-bit ternary, 4/8-bit standard).
    • Activation quantization -- Static (EMA range) or dynamic (per-forward range) quantization of input activations.
  • Conv2dLayer_Compress -- Extends nn.Conv2d with sparse pruning, channel pruning, weight quantization, and activation quantization.
  • BNLayer_Compress -- Extends nn.BatchNorm2d with channel pruning support for dimension reduction.
  • ColumnParallelLinear_Compress and RowParallelLinear_Compress -- Model-parallel variants of LinearLayer_Compress that partition weights across tensor-parallel groups while preserving all compression capabilities.
  • Model-parallel autograd functions -- _CopyToModelParallelRegion, _ReduceFromModelParallelRegion, _ScatterToModelParallelRegion, _GatherFromModelParallelRegion implement the forward/backward communication patterns for tensor parallelism.

This is AUTO_KEEP vendored code from DeepSpeed.

Code Reference

Field Value
Repository FlexLLMGen
File benchmark/third_party/DeepSpeed/deepspeed/compression/basic_layer.py
Lines 1-923

Key Classes:

class QuantAct(nn.Module):
    def __init__(self, act_range_momentum=0.95, quant_mode='symmetric'): ...
    def forward(self, x, num_bits, *args): ...

class Embedding_Compress(nn.Embedding):
    def enable_weight_quantization(self, start_bits, target_bits, ...): ...

class LinearLayer_Compress(nn.Linear):
    def enable_sparse_pruning(self, ratio, method): ...
    def enable_row_pruning(self, ratio, method): ...
    def enable_head_pruning(self, ratio, method, num_heads): ...
    def enable_weight_quantization(self, start_bits, target_bits, ...): ...
    def enable_activation_quantization(self, bits, quantization_type, range_calibration): ...
    def forward(self, input, skip_bias_add=False): ...

class Conv2dLayer_Compress(nn.Conv2d):
    def enable_sparse_pruning(self, ratio, method): ...
    def enable_channel_pruning(self, ratio, method): ...

class ColumnParallelLinear_Compress(LinearLayer_Compress): ...
class RowParallelLinear_Compress(LinearLayer_Compress): ...

I/O Contract

LinearLayer_Compress.forward

Parameter Type Required Description
input Tensor Yes Input activation tensor
skip_bias_add bool No If True, returns (output, bias) separately for fused bias-add patterns
Output Type Description
output Tensor Compressed forward pass result with all enabled compression techniques applied
bias Tensor or None Bias tensor (only when skip_bias_add=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment