Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:VainF Torch Pruning Group Norm Regularization

From Leeroopedia


Template:Metadata

Overview

An importance-adaptive group-level regularization method that drives low-importance channels toward zero during training.

Description

Group Norm Regularization extends traditional sparsity regularization to operate at the dependency group level. Instead of applying uniform regularization, it uses an importance-adaptive scaling factor:

Failed to parse (syntax error): {\displaystyle \gamma = \alpha^{\frac{\text{imp\_max} - \text{imp}}{\text{imp\_max} - \text{imp\_min}}}}

that applies stronger regularization to less important channels. This creates a natural sparsity pattern aligned with the dependency graph structure, making subsequent structural pruning more effective.

The method modifies weight gradients in-place during training:

grad+=reg×γ×w

Key characteristics of Group Norm Regularization:

  • Group-level operation: Regularization is applied across entire dependency groups, not individual layers, ensuring consistency across structurally coupled parameters.
  • Importance-adaptive scaling: The exponential scaling factor gamma ensures that channels with lower importance scores receive stronger regularization pressure.
  • In-place gradient modification: The regularizer operates by directly modifying weight.grad.data, avoiding any changes to the loss function itself.
  • Covers multiple layer types: Handles output channels (Conv, Linear), input channels, and BatchNorm layers within each group.

Usage

Use during the sparse training phase before pruning. Best suited for CNN architectures where group-level sparsity patterns need to be learned jointly. This is the primary regularization method of DepGraph (CVPR 2023).

The typical workflow is:

  1. Initialize GroupNormPruner with a model and importance estimator.
  2. During training, call update_regularizer() at the start of each epoch to refresh group structure.
  3. After each loss.backward(), call regularize(model) to inject adaptive regularization into gradients.
  4. After sparse training completes, call step() to perform the actual structural pruning.

Theoretical Basis

For each group g with importance scores {s1,,sn}, compute the adaptive factor:

γi=αsmaxsismaxsmin

The gradient is then modified as:

wi+=λγiwi

where λ is the base regularization coefficient.

This formulation has the following properties:

  • When si=smax, γi=α0=1 (minimal regularization).
  • When si=smin, γi=α1=α (maximum regularization).
  • The parameter α controls the regularization range, defaulting to 4 (i.e., scaling from 20 to 2α).

This drives low-importance channels to zero faster than high-importance ones, producing a clean sparsity structure aligned with the dependency graph.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment