Principle:FMInference FlexLLMGen Model Compression Configuration
| Field | Value |
|---|---|
| Sources | Paper: FlexGen, DeepSpeed Compression Documentation |
| Domains | Model_Compression, Configuration_Management |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A declarative configuration scheme for specifying model compression techniques (quantization, pruning, layer reduction) in DeepSpeed, using a two-level structure that separates global shared parameters from per-module-group settings.
Description
Model compression configuration provides a structured way to declare which compression techniques to apply to which parts of a model, enabling fine-grained control over the compression-accuracy tradeoff. The configuration is embedded in the standard DeepSpeed JSON config file under a compression_training section.
Key design principles:
- Shared/group hierarchy -- Each compression technique has shared_parameters (global settings like method and schedule) and different_groups (per-layer-group settings like bit-widths and density ratios). This avoids repetition while allowing different model sections to use different compression parameters. For example, attention layers might use 8-bit quantization while MLP layers use 4-bit.
- Module scope targeting -- Each group specifies a module_scope (list of module name patterns) to select which layers receive the compression. A related_module_scope specifies layers that must be co-modified (e.g., when pruning rows in one layer, the corresponding columns in the next layer must also be pruned).
- Schedule-based introduction -- Each technique has a schedule_offset parameter that delays its activation until a specific training step. This enables progressive compression: the model first trains at full precision, then compression is introduced after the model has learned good initial representations.
- Validation at parse time -- The configuration parser validates enumeration values (symmetric vs. asymmetric, L1 vs. TopK), required fields (start_bits and target_bits for quantization), and logical constraints (head pruning requires num_heads). This catches misconfigurations early rather than during training.
- Default-based simplicity -- Every parameter has a sensible default value, so minimal configuration is needed for common use cases. Only the enabled flag and technique-specific required parameters (like target bit-width) need to be explicitly set.
- Orthogonal composition -- Multiple compression techniques can be enabled simultaneously. Weight quantization, activation quantization, sparse pruning, row pruning, head pruning, and channel pruning are configured independently and applied in a defined order during the forward pass.
Usage
Use model compression configuration to declare the desired compression strategy in the DeepSpeed JSON config file. The configuration is parsed at initialization time and used by the compression-aware layer replacements to set up the appropriate masks, quantizers, and schedules.
Common configuration patterns:
- Quantization-only -- Enable weight quantization with 4-bit target, symmetric mode, and 1000-step offset.
- Pruning + quantization -- Enable sparse pruning (TopK, 50% density) and weight quantization (8-bit) on different groups of layers.
- Head pruning -- Enable head pruning on attention output matrices to reduce the number of active attention heads.
Theoretical Basis
The two-level configuration structure reflects the observation that compression techniques often require heterogeneous application across model layers. Research shows that:
- Not all layers are equally compressible -- Early and final layers are more sensitive to compression than middle layers.
- Different tensor types need different settings -- Attention weights may tolerate lower precision than MLP weights.
- Related layers must be co-modified -- Structural pruning (rows, heads, channels) creates dependencies between adjacent layers that must be coordinated.
The schedule-based introduction is motivated by the lottery ticket hypothesis and progressive compression research, which show that models trained with gradually increasing compression achieve better accuracy than models trained with full compression from the start.