Principle:VainF Torch Pruning Taylor Importance
Overview
First-Order Taylor Expansion Importance is an importance estimation criterion for structured neural network pruning. It uses a first-order Taylor expansion of the loss function to approximate the effect of removing each channel (or filter) from a network. By combining weight magnitudes with gradient information obtained from a backward pass on calibration data, Taylor importance provides a principled, data-driven measure of each channel's contribution to the network's output.
Description
The core idea behind Taylor importance is to estimate the change in the loss function that would result from setting a particular set of parameters (e.g., all weights in a convolutional channel) to zero. Rather than computing this change exactly -- which would require a forward pass for every candidate channel -- the method uses a first-order Taylor expansion to obtain a local linear approximation.
Given a loss function L and a set of network parameters w, the change in loss when a parameter w_i is removed (set to zero) can be approximated as:
where is the gradient of the loss with respect to parameter w_i.
For a convolutional channel c containing multiple parameters, the importance is aggregated across all parameters belonging to that channel. Two variants exist:
- Standard (abs-then-sum): Compute the absolute value of each element-wise product first, then sum across the channel. This treats each parameter independently:
- Multivariable (sum-then-abs): Sum the element-wise products first, then take the absolute value. This captures correlations between parameters within a channel:
The standard variant is more conservative: it never allows cancellation between positive and negative contributions, resulting in uniformly higher importance scores. The multivariable variant can yield lower scores when positive and negative Taylor terms within a channel cancel each other out, potentially identifying channels whose parameters have offsetting effects on the loss.
Usage
Taylor importance is appropriate when:
- Gradient information is available: The method requires a backward pass on calibration data to compute gradients. This means at least a small batch of representative training or validation data must be fed through the network with
loss.backward()called before scoring. - Data-driven pruning is desired: Unlike magnitude-only methods (e.g., L1-norm pruning), Taylor importance takes into account how each channel interacts with the actual data distribution, often yielding more accurate importance estimates.
- Structured pruning is the goal: Taylor importance naturally extends to groups of parameters (e.g., entire output channels of a convolution, along with their dependent batch normalization parameters and downstream input channels), making it well-suited for structural pruning where entire channels or filters are removed.
Taylor importance is more computationally expensive than pure magnitude-based methods because it requires a backward pass, but it is less expensive than second-order methods (e.g., Hessian-based) which require computing or approximating curvature information.
Theoretical Basis
The Taylor expansion importance criterion is derived from a first-order approximation of the loss function around the current parameter values.
Consider the loss L(w) as a function of the full parameter vector w. When a subset of parameters (a channel) is pruned (set to zero), the new loss is L(w) where w differs from w only in the pruned positions. The change in loss is:
Using a first-order Taylor expansion around w:
since w'_i = 0 for pruned parameters and w'_i = w_i for retained parameters.
Taking the absolute value gives the importance:
This is the multivariable formulation. The standard formulation instead sums the absolute values:
The standard formulation can be interpreted as an upper bound on the multivariable formulation (via the triangle inequality), providing a more conservative estimate. The multivariable formulation is a more faithful approximation of the actual loss change but may underestimate importance when parameter contributions cancel.
Key assumptions:
- The loss surface is locally smooth (well-approximated by its first-order Taylor expansion).
- The pruning perturbation is small enough that higher-order terms are negligible.
- Gradients are computed on data that is representative of the deployment distribution.