Principle:VainF Torch Pruning Group Norm Regularization
Overview
An importance-adaptive group-level regularization method that drives low-importance channels toward zero during training.
Description
Group Norm Regularization extends traditional sparsity regularization to operate at the dependency group level. Instead of applying uniform regularization, it uses an importance-adaptive scaling factor:
- Failed to parse (syntax error): {\displaystyle \gamma = \alpha^{\frac{\text{imp\_max} - \text{imp}}{\text{imp\_max} - \text{imp\_min}}}}
that applies stronger regularization to less important channels. This creates a natural sparsity pattern aligned with the dependency graph structure, making subsequent structural pruning more effective.
The method modifies weight gradients in-place during training:
Key characteristics of Group Norm Regularization:
- Group-level operation: Regularization is applied across entire dependency groups, not individual layers, ensuring consistency across structurally coupled parameters.
- Importance-adaptive scaling: The exponential scaling factor
gammaensures that channels with lower importance scores receive stronger regularization pressure. - In-place gradient modification: The regularizer operates by directly modifying
weight.grad.data, avoiding any changes to the loss function itself. - Covers multiple layer types: Handles output channels (Conv, Linear), input channels, and BatchNorm layers within each group.
Usage
Use during the sparse training phase before pruning. Best suited for CNN architectures where group-level sparsity patterns need to be learned jointly. This is the primary regularization method of DepGraph (CVPR 2023).
The typical workflow is:
- Initialize
GroupNormPrunerwith a model and importance estimator. - During training, call
update_regularizer()at the start of each epoch to refresh group structure. - After each
loss.backward(), callregularize(model)to inject adaptive regularization into gradients. - After sparse training completes, call
step()to perform the actual structural pruning.
Theoretical Basis
For each group g with importance scores , compute the adaptive factor:
The gradient is then modified as:
where is the base regularization coefficient.
This formulation has the following properties:
- When , (minimal regularization).
- When , (maximum regularization).
- The parameter controls the regularization range, defaulting to 4 (i.e., scaling from to ).
This drives low-importance channels to zero faster than high-importance ones, producing a clean sparsity structure aligned with the dependency graph.