Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:FMInference FlexLLMGen Model Compression Configuration

From Leeroopedia


Field Value
Sources Paper: FlexGen, DeepSpeed Compression Documentation
Domains Model_Compression, Configuration_Management
Last Updated 2026-02-09 00:00 GMT

Overview

A declarative configuration scheme for specifying model compression techniques (quantization, pruning, layer reduction) in DeepSpeed, using a two-level structure that separates global shared parameters from per-module-group settings.

Description

Model compression configuration provides a structured way to declare which compression techniques to apply to which parts of a model, enabling fine-grained control over the compression-accuracy tradeoff. The configuration is embedded in the standard DeepSpeed JSON config file under a compression_training section.

Key design principles:

  • Shared/group hierarchy -- Each compression technique has shared_parameters (global settings like method and schedule) and different_groups (per-layer-group settings like bit-widths and density ratios). This avoids repetition while allowing different model sections to use different compression parameters. For example, attention layers might use 8-bit quantization while MLP layers use 4-bit.
  • Module scope targeting -- Each group specifies a module_scope (list of module name patterns) to select which layers receive the compression. A related_module_scope specifies layers that must be co-modified (e.g., when pruning rows in one layer, the corresponding columns in the next layer must also be pruned).
  • Schedule-based introduction -- Each technique has a schedule_offset parameter that delays its activation until a specific training step. This enables progressive compression: the model first trains at full precision, then compression is introduced after the model has learned good initial representations.
  • Validation at parse time -- The configuration parser validates enumeration values (symmetric vs. asymmetric, L1 vs. TopK), required fields (start_bits and target_bits for quantization), and logical constraints (head pruning requires num_heads). This catches misconfigurations early rather than during training.
  • Default-based simplicity -- Every parameter has a sensible default value, so minimal configuration is needed for common use cases. Only the enabled flag and technique-specific required parameters (like target bit-width) need to be explicitly set.
  • Orthogonal composition -- Multiple compression techniques can be enabled simultaneously. Weight quantization, activation quantization, sparse pruning, row pruning, head pruning, and channel pruning are configured independently and applied in a defined order during the forward pass.

Usage

Use model compression configuration to declare the desired compression strategy in the DeepSpeed JSON config file. The configuration is parsed at initialization time and used by the compression-aware layer replacements to set up the appropriate masks, quantizers, and schedules.

Common configuration patterns:

  • Quantization-only -- Enable weight quantization with 4-bit target, symmetric mode, and 1000-step offset.
  • Pruning + quantization -- Enable sparse pruning (TopK, 50% density) and weight quantization (8-bit) on different groups of layers.
  • Head pruning -- Enable head pruning on attention output matrices to reduce the number of active attention heads.

Theoretical Basis

The two-level configuration structure reflects the observation that compression techniques often require heterogeneous application across model layers. Research shows that:

  • Not all layers are equally compressible -- Early and final layers are more sensitive to compression than middle layers.
  • Different tensor types need different settings -- Attention weights may tolerate lower precision than MLP weights.
  • Related layers must be co-modified -- Structural pruning (rows, heads, channels) creates dependencies between adjacent layers that must be coordinated.

The schedule-based introduction is motivated by the lottery ticket hypothesis and progressive compression research, which show that models trained with gradually increasing compression achieve better accuracy than models trained with full compression from the start.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment