Principle:VainF Torch Pruning Structural Pruning

Metadata

Field	Value
Papers	DepGraph: Towards Any Structural Pruning (Fang et al., CVPR 2023), Isomorphic Pruning (Fang et al., ECCV 2024)
Domains	Deep_Learning, Model_Compression, Pruning
Last Updated	2026-02-08 00:00 GMT

Overview

A dependency-graph-driven approach to structural pruning that automatically identifies and removes coupled channels across interconnected layers.

Description

Structural pruning removes entire channels or filters from a neural network rather than zeroing out individual weights. Unlike unstructured (weight-level) pruning, structural pruning directly reduces the dimensions of weight tensors, yielding real speedups on standard hardware without the need for sparse-computation libraries.

The central challenge of structural pruning is inter-layer coupling: removing an output channel from one convolutional layer invalidates the corresponding input channel of every downstream layer that consumes that feature map. In networks with skip connections, concatenation, or split operations, the set of layers affected by a single channel removal can be large and non-obvious.

DepGraph solves this problem by:

Tracing the model with example inputs to build a computational graph.
Constructing a dependency graph where nodes represent layers (or parameters) and edges represent structural dependencies between them.
Grouping coupled parameters: given a pruning target layer and a set of channel indices, the dependency graph propagates the pruning decision through all edges to collect a Group -- the minimal set of (layer, indices) pairs that must be pruned together to keep the network structurally valid.
Pruning the group atomically: all layers in a group are pruned in a single coordinated operation.

The framework supports several ranking scopes that determine how channel importance is compared:

Local pruning -- each layer is ranked independently; channels are removed per-layer according to a target ratio.
Global pruning -- all channels across the entire model are ranked jointly; the least important channels network-wide are removed first.
Isomorphic pruning (ECCV 2024) -- groups with identical graph topology (same pattern of layer types and dependency edges) share a single ranking scope, balancing local and global strategies.

Additional capabilities include:

Iterative multi-step pruning -- the target ratio is reached gradually over T steps, with a configurable scheduler controlling how much is removed at each step.
Multi-head attention support -- channels can be pruned at the head-dimension level, or entire attention heads can be removed, with correct handling of grouped query attention (GQA).
Channel-group awareness -- grouped convolutions and group normalization layers are handled by averaging importance across groups and replicating pruning indices.

Usage

Use structural pruning when you need to reduce model size and inference latency through channel removal. This is the primary approach for all pruning workflows in the Torch-Pruning library. Typical scenarios include:

Compressing a pretrained CNN (ResNet, VGG, EfficientNet, etc.) before deployment.
Reducing the hidden dimensions of a Vision Transformer or BERT model.
Iteratively pruning and fine-tuning to recover accuracy at a target compression ratio.

Theoretical Basis

Given a neural network, construct a dependency graph G = (V, E) where:

V = set of layers (modules and parameters) in the network.
E = set of directed dependency edges. An edge (u, v) means that a pruning operation on layer u triggers a required pruning operation on layer v.

For a pruning target layer l with pruning indices I, propagate the pruning through all reachable dependencies to obtain a Group:

G_l = { (l_i, I_i) | l_i is reachable from l in G, I_i = mapped indices }

The Group is the atomic pruning unit -- all entries must be pruned together to maintain structural consistency.

Pruning ratio scheduling: for iterative pruning with T steps targeting a final ratio r, the default linear scheduler computes the per-step cumulative ratio as:

r_t = (t / T) * r,    for t = 1, 2, ..., T

At step t, the number of channels to remove is determined by the difference between the current channel count and initial_channels * (1 - r_t).

Importance estimation is pluggable. Common choices include:

Magnitude-based -- L1 or L2 norm of filter weights.
Taylor-based -- first-order Taylor expansion of the loss with respect to channel removal.
Hessian-based -- second-order (OBD/OBS-style) importance using Fisher information.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment