Principle:VainF Torch Pruning Hessian Importance

Metadata

Field	Value
Page Type	Principle
Knowledge Sources	Paper (Optimal Brain Damage (LeCun et al., NeurIPS 1989)), Paper (DepGraph)
Domains	Deep_Learning, Model_Compression, Pruning
Last Updated	2026-02-08 00:00 GMT

Overview

Second-order importance estimation using diagonal Hessian approximation to measure the curvature-weighted impact of removing channels.

Description

Optimal Brain Damage uses second-order information (the diagonal of the Hessian matrix) to estimate the importance of individual parameters and, by extension, entire channels. Rather than relying solely on weight magnitude, this approach accounts for the local curvature of the loss surface.

The importance of a channel is proportional to:

w^{2} \cdot H_{diag}

where $H_{diag}$ approximates the second-order partial derivative $\frac{\partial^{2} L}{\partial w^{2}}$ .

To avoid the prohibitive cost of computing the full Hessian, the diagonal Hessian is estimated via per-sample gradient accumulation:

H_{diag} \approx 𝔼 [g^{2}]

where $g = \frac{\partial L}{\partial w}$ is the per-sample gradient. This Gauss-Newton approximation accumulates the squared gradients across a batch of training samples to build an efficient diagonal estimate.

Usage

Hessian-based importance estimation is recommended when:

Highest accuracy importance estimates are needed and the computational cost of per-sample gradient accumulation is acceptable.
Well-conditioned models are being pruned, where the second-order information provides a more faithful estimate of parameter sensitivity than first-order (Taylor) methods.
Structured pruning scenarios where removing an entire channel or filter group requires aggregating importance scores across all parameters in that group.

Hessian importance is more accurate than Taylor-based importance for well-conditioned models, at the cost of requiring multiple forward-backward passes to accumulate the diagonal Hessian estimate.

Theoretical Basis

The core idea originates from a second-order Taylor expansion of the loss function around the current parameter values. When a set of parameters is removed (set to zero), the resulting change in loss is approximated as:

Δ L \approx \frac{1}{2} 𝐰^{T} 𝐇 𝐰

Under the diagonal approximation (assuming off-diagonal Hessian entries are negligible), this simplifies to a per-parameter importance score. For a channel $c$ , the group importance is the sum over all parameters belonging to that channel:

importance (c) = \sum_{i \in c} w_{i}^{2} \cdot H_{i i}

The diagonal Hessian entries $H_{i i}$ are estimated using the Gauss-Newton approximation:

𝔼 [\nabla^{2} L] \approx 𝔼 [{(\frac{\partial L}{\partial w})}^{2}]

This approximation is exact for linear models with squared loss and serves as a positive semi-definite approximation of the true Hessian for general models, ensuring that importance scores are non-negative.

Related Pages

Implementation:VainF_Torch_Pruning_GroupHessianImportance

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment