Principle:VainF Torch Pruning BN Scale Regularization

Overview

Network Slimming via L1 regularization on batch normalization scaling factors to induce channel-level sparsity.

Description

Batch normalization layers have a learnable scaling factor $γ$ for each channel. By adding L1 regularization on these $γ$ values during training, channels with small $γ$ can be identified as unimportant and removed. This approach is elegant because BN layers already exist in most modern CNN architectures, requiring no additional parameters or architectural changes.

Key aspects of BN Scale Regularization:

Leverages existing BN parameters: No new learnable parameters are needed; the method repurposes the BatchNorm scaling factor $γ$ as an importance indicator.
L1 sparsity penalty: The sign-based gradient update grad += reg * sign(weight) drives small scaling factors toward exactly zero.
Extended to group level: In Torch-Pruning, BN scales are aggregated across dependency groups to ensure structural consistency when pruning coupled layers.
Group lasso variant: An optional group lasso mode replaces the L1 penalty with an L2-based group penalty, regularizing BN weights proportionally to the inverse of the group L2 norm: grad += reg * (1 / ||group||_2) * weight.

Usage

Use for CNNs with BatchNorm layers when you want a simple, well-studied regularization approach. This is the classic Network Slimming method.

Recommended scenarios:

Architectures that already contain BatchNorm1d, BatchNorm2d, or BatchNorm3d layers.
When a lightweight, easy-to-implement sparsity-inducing technique is desired.
When group-level consistency is needed, enable the group_lasso variant to regularize across dependency groups.

Theoretical Basis

The total loss with BN scale regularization is:

L_{total} = L_{task} + λ \sum_{i} | γ_{i} |

The gradient modification for standard L1 mode:

\nabla γ + = λ \cdot sign (γ)

Channels with resulting small $| γ |$ values are pruned.

For the group lasso variant, the gradient modification becomes:

\nabla γ + = λ \cdot \frac{γ}{‖ γ_{group} ‖_{2}}

This encourages entire groups to shrink together, producing a cleaner structural sparsity pattern that respects layer dependencies.

Related Pages

Implementation:VainF_Torch_Pruning_BNScalePruner

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment