Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:VainF Torch Pruning Hessian Importance

From Leeroopedia


Metadata

Field Value
Page Type Principle
Knowledge Sources Paper (Optimal Brain Damage (LeCun et al., NeurIPS 1989)), Paper (DepGraph)
Domains Deep_Learning, Model_Compression, Pruning
Last Updated 2026-02-08 00:00 GMT

Overview

Second-order importance estimation using diagonal Hessian approximation to measure the curvature-weighted impact of removing channels.

Description

Optimal Brain Damage uses second-order information (the diagonal of the Hessian matrix) to estimate the importance of individual parameters and, by extension, entire channels. Rather than relying solely on weight magnitude, this approach accounts for the local curvature of the loss surface.

The importance of a channel is proportional to:

w2Hdiag

where Hdiag approximates the second-order partial derivative 2Lw2.

To avoid the prohibitive cost of computing the full Hessian, the diagonal Hessian is estimated via per-sample gradient accumulation:

Hdiag𝔼[g2]

where g=Lw is the per-sample gradient. This Gauss-Newton approximation accumulates the squared gradients across a batch of training samples to build an efficient diagonal estimate.

Usage

Hessian-based importance estimation is recommended when:

  • Highest accuracy importance estimates are needed and the computational cost of per-sample gradient accumulation is acceptable.
  • Well-conditioned models are being pruned, where the second-order information provides a more faithful estimate of parameter sensitivity than first-order (Taylor) methods.
  • Structured pruning scenarios where removing an entire channel or filter group requires aggregating importance scores across all parameters in that group.

Hessian importance is more accurate than Taylor-based importance for well-conditioned models, at the cost of requiring multiple forward-backward passes to accumulate the diagonal Hessian estimate.

Theoretical Basis

The core idea originates from a second-order Taylor expansion of the loss function around the current parameter values. When a set of parameters is removed (set to zero), the resulting change in loss is approximated as:

ΔL12𝐰T𝐇𝐰

Under the diagonal approximation (assuming off-diagonal Hessian entries are negligible), this simplifies to a per-parameter importance score. For a channel c, the group importance is the sum over all parameters belonging to that channel:

importance(c)=icwi2Hii

The diagonal Hessian entries Hii are estimated using the Gauss-Newton approximation:

𝔼[2L]𝔼[(Lw)2]

This approximation is exact for linear models with squared loss and serves as a positive semi-definite approximation of the true Hessian for general models, ensuring that importance scores are non-negative.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment