Implementation:FMInference FlexLLMGen DeepSpeed Compression Config
| Field | Value |
|---|---|
| Sources | Repo: FlexLLMGen |
| Domains | Model_Compression, Configuration_Management |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Vendored DeepSpeed compression configuration parser that reads the compression training section of a DeepSpeed JSON config and produces structured configuration dictionaries for weight quantization, activation quantization, sparse pruning, row pruning, head pruning, channel pruning, and layer reduction.
Description
config.py provides the configuration parsing layer for DeepSpeed's compression training feature. It reads a nested dictionary from the DeepSpeed JSON configuration (under the compression_training key) and normalizes it into a structured output with validated parameters and applied defaults.
The module handles six compression technique categories, each organized with the same two-level structure:
- shared_parameters -- Global settings for the technique (enabled flag, method, schedule offset, quantization type).
- different_groups -- Per-module-group settings that specify which model layers receive the compression and with what parameters (bit-widths, density ratios, module scopes).
Supported compression techniques:
- Weight quantization -- Configures bit-width (start and target), quantization type (symmetric/asymmetric), rounding mode (nearest/stochastic), number of groups, FP16 mixed quantization, and schedule offset.
- Activation quantization -- Configures bit-width, quantization type, range calibration mode (static/dynamic), and schedule offset.
- Sparse pruning -- Configures density ratio and method (L1/TopK).
- Row pruning -- Configures density ratio and method (L1/TopK).
- Head pruning -- Configures density ratio, method (L1/TopK), and number of attention heads.
- Channel pruning -- Configures density ratio and method (L1/TopK).
- Layer reduction -- Configures layer elimination parameters.
Each getter function validates required fields (e.g., start_bits and target_bits must be specified for weight quantization groups), asserts enumeration constraints (e.g., quantization type must be 'symmetric' or 'asymmetric'), and applies default values from the constants module.
This is AUTO_KEEP vendored code from DeepSpeed.
Code Reference
| Field | Value |
|---|---|
| Repository | FlexLLMGen |
| File | benchmark/third_party/DeepSpeed/deepspeed/compression/config.py |
| Lines | 1-490 |
Key Functions:
def get_compression_config(param_dict): ...
def get_weight_quantization(param_dict): ...
def get_activation_quantization(param_dict): ...
def get_sparse_pruning(param_dict): ...
def get_row_pruning(param_dict): ...
def get_head_pruning(param_dict): ...
def get_channel_pruning(param_dict): ...
def get_layer_reduction(param_dict): ...
def get_quantize_enabled(param_dict): ...
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| param_dict | dict | Yes | Top-level DeepSpeed configuration dictionary containing optional compression_training key |
Outputs
| Output | Type | Description |
|---|---|---|
| output | dict | Structured compression config with keys: weight_quantization, activation_quantization, sparse_pruning, row_pruning, head_pruning, channel_pruning, layer_reduction. Each contains shared_parameters and different_groups sub-dicts. |
Configuration Structure
{
"weight_quantization": {
"shared_parameters": {
"enabled": bool, "schedule_offset": int, "quantize_groups": int,
"quantize_verbose": bool, "quantization_type": "symmetric"|"asymmetric",
"quantize_weight_in_forward": bool, "rounding": "nearest"|"stochastic",
"fp16_mixed_quantize": {"enabled": bool, "quantize_change_ratio": float}
},
"different_groups": {
"group_name": {
"params": {"start_bits": int, "target_bits": int, "quantization_period": int},
"modules": [...], "related_modules": [...]
}
}
},
"sparse_pruning": { ... },
"row_pruning": { ... },
"head_pruning": { ... },
"channel_pruning": { ... },
"activation_quantization": { ... },
"layer_reduction": { ... }
}