Implementation:FMInference FlexLLMGen DeepSpeed Compression Layers
| Field | Value |
|---|---|
| Sources | Repo: FlexLLMGen |
| Domains | Model_Compression, Quantization, Pruning, Neural_Network_Layers |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Vendored DeepSpeed compression-aware neural network layer implementations that extend standard PyTorch layers (Linear, Conv2d, Embedding, BatchNorm2d) with integrated support for weight quantization, activation quantization, sparse pruning, row/head pruning, and channel pruning.
Description
basic_layer.py provides drop-in replacement layers for PyTorch's nn.Linear, nn.Conv2d, nn.Embedding, and nn.BatchNorm2d that can be dynamically enabled with various compression techniques during training. The compression is applied in the forward pass and can be gradually introduced using schedule offsets.
Key classes:
- QuantAct -- Standalone activation quantization module using exponential moving average for range calibration (static mode) or per-forward min/max (dynamic mode). Supports symmetric and asymmetric quantization.
- Embedding_Compress -- Extends nn.Embedding with weight quantization support. Uses token-wise quantization groups (one group per embedding row).
- LinearLayer_Compress -- Extends nn.Linear with:
- Sparse pruning -- L1-norm or TopK masking of individual weights.
- Row pruning -- L1-norm or TopK masking of entire output rows.
- Head pruning -- TopK masking of attention heads (applied to the O matrix).
- Weight quantization -- Symmetric/asymmetric quantization with configurable bit-width (1-bit binary, 2-bit ternary, 4/8-bit standard).
- Activation quantization -- Static (EMA range) or dynamic (per-forward range) quantization of input activations.
- Conv2dLayer_Compress -- Extends nn.Conv2d with sparse pruning, channel pruning, weight quantization, and activation quantization.
- BNLayer_Compress -- Extends nn.BatchNorm2d with channel pruning support for dimension reduction.
- ColumnParallelLinear_Compress and RowParallelLinear_Compress -- Model-parallel variants of LinearLayer_Compress that partition weights across tensor-parallel groups while preserving all compression capabilities.
- Model-parallel autograd functions -- _CopyToModelParallelRegion, _ReduceFromModelParallelRegion, _ScatterToModelParallelRegion, _GatherFromModelParallelRegion implement the forward/backward communication patterns for tensor parallelism.
This is AUTO_KEEP vendored code from DeepSpeed.
Code Reference
| Field | Value |
|---|---|
| Repository | FlexLLMGen |
| File | benchmark/third_party/DeepSpeed/deepspeed/compression/basic_layer.py |
| Lines | 1-923 |
Key Classes:
class QuantAct(nn.Module):
def __init__(self, act_range_momentum=0.95, quant_mode='symmetric'): ...
def forward(self, x, num_bits, *args): ...
class Embedding_Compress(nn.Embedding):
def enable_weight_quantization(self, start_bits, target_bits, ...): ...
class LinearLayer_Compress(nn.Linear):
def enable_sparse_pruning(self, ratio, method): ...
def enable_row_pruning(self, ratio, method): ...
def enable_head_pruning(self, ratio, method, num_heads): ...
def enable_weight_quantization(self, start_bits, target_bits, ...): ...
def enable_activation_quantization(self, bits, quantization_type, range_calibration): ...
def forward(self, input, skip_bias_add=False): ...
class Conv2dLayer_Compress(nn.Conv2d):
def enable_sparse_pruning(self, ratio, method): ...
def enable_channel_pruning(self, ratio, method): ...
class ColumnParallelLinear_Compress(LinearLayer_Compress): ...
class RowParallelLinear_Compress(LinearLayer_Compress): ...
I/O Contract
LinearLayer_Compress.forward
| Parameter | Type | Required | Description |
|---|---|---|---|
| input | Tensor | Yes | Input activation tensor |
| skip_bias_add | bool | No | If True, returns (output, bias) separately for fused bias-add patterns |
| Output | Type | Description |
|---|---|---|
| output | Tensor | Compressed forward pass result with all enabled compression techniques applied |
| bias | Tensor or None | Bias tensor (only when skip_bias_add=True) |