Principle:VainF Torch Pruning Hessian Importance
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Paper (Optimal Brain Damage (LeCun et al., NeurIPS 1989)), Paper (DepGraph) |
| Domains | Deep_Learning, Model_Compression, Pruning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Second-order importance estimation using diagonal Hessian approximation to measure the curvature-weighted impact of removing channels.
Description
Optimal Brain Damage uses second-order information (the diagonal of the Hessian matrix) to estimate the importance of individual parameters and, by extension, entire channels. Rather than relying solely on weight magnitude, this approach accounts for the local curvature of the loss surface.
The importance of a channel is proportional to:
where approximates the second-order partial derivative .
To avoid the prohibitive cost of computing the full Hessian, the diagonal Hessian is estimated via per-sample gradient accumulation:
where is the per-sample gradient. This Gauss-Newton approximation accumulates the squared gradients across a batch of training samples to build an efficient diagonal estimate.
Usage
Hessian-based importance estimation is recommended when:
- Highest accuracy importance estimates are needed and the computational cost of per-sample gradient accumulation is acceptable.
- Well-conditioned models are being pruned, where the second-order information provides a more faithful estimate of parameter sensitivity than first-order (Taylor) methods.
- Structured pruning scenarios where removing an entire channel or filter group requires aggregating importance scores across all parameters in that group.
Hessian importance is more accurate than Taylor-based importance for well-conditioned models, at the cost of requiring multiple forward-backward passes to accumulate the diagonal Hessian estimate.
Theoretical Basis
The core idea originates from a second-order Taylor expansion of the loss function around the current parameter values. When a set of parameters is removed (set to zero), the resulting change in loss is approximated as:
Under the diagonal approximation (assuming off-diagonal Hessian entries are negligible), this simplifies to a per-parameter importance score. For a channel , the group importance is the sum over all parameters belonging to that channel:
The diagonal Hessian entries are estimated using the Gauss-Newton approximation:
This approximation is exact for linear models with squared loss and serves as a positive semi-definite approximation of the true Hessian for general models, ensuring that importance scores are non-negative.