Principle:VainF Torch Pruning Group Norm Regularization

Overview

An importance-adaptive group-level regularization method that drives low-importance channels toward zero during training.

Description

Group Norm Regularization extends traditional sparsity regularization to operate at the dependency group level. Instead of applying uniform regularization, it uses an importance-adaptive scaling factor:

Failed to parse (syntax error): {\displaystyle \gamma = \alpha^{\frac{\text{imp\_max} - \text{imp}}{\text{imp\_max} - \text{imp\_min}}}}

that applies stronger regularization to less important channels. This creates a natural sparsity pattern aligned with the dependency graph structure, making subsequent structural pruning more effective.

The method modifies weight gradients in-place during training:

grad + = reg \times γ \times w

Key characteristics of Group Norm Regularization:

Group-level operation: Regularization is applied across entire dependency groups, not individual layers, ensuring consistency across structurally coupled parameters.
Importance-adaptive scaling: The exponential scaling factor gamma ensures that channels with lower importance scores receive stronger regularization pressure.
In-place gradient modification: The regularizer operates by directly modifying weight.grad.data, avoiding any changes to the loss function itself.
Covers multiple layer types: Handles output channels (Conv, Linear), input channels, and BatchNorm layers within each group.

Usage

Use during the sparse training phase before pruning. Best suited for CNN architectures where group-level sparsity patterns need to be learned jointly. This is the primary regularization method of DepGraph (CVPR 2023).

The typical workflow is:

Initialize GroupNormPruner with a model and importance estimator.
During training, call update_regularizer() at the start of each epoch to refresh group structure.
After each loss.backward(), call regularize(model) to inject adaptive regularization into gradients.
After sparse training completes, call step() to perform the actual structural pruning.

Theoretical Basis

For each group g with importance scores ${s_{1}, \dots, s_{n}}$ , compute the adaptive factor:

γ_{i} = α^{\frac{s_{\max} - s_{i}}{s_{\max} - s_{\min}}}

The gradient is then modified as:

\nabla w_{i} + = λ \cdot γ_{i} \cdot w_{i}

where $λ$ is the base regularization coefficient.

This formulation has the following properties:

When $s_{i} = s_{\max}$ , $γ_{i} = α^{0} = 1$ (minimal regularization).
When $s_{i} = s_{\min}$ , $γ_{i} = α^{1} = α$ (maximum regularization).
The parameter $α$ controls the regularization range, defaulting to 4 (i.e., scaling from $2^{0}$ to $2^{α}$ ).

This drives low-importance channels to zero faster than high-importance ones, producing a clean sparsity structure aligned with the dependency graph.

Related Pages

Implementation:VainF_Torch_Pruning_GroupNormPruner

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment