Heuristic:VainF Torch Pruning Channel Rounding Alignment

Knowledge Sources	Torch-Pruning NVIDIA DL Performance Guide
Domains	Optimization, Deep_Learning
Last Updated	2026-02-08 12:00 GMT

Overview

Round pruned channel counts to multiples of 4 or 8 to maximize GPU hardware utilization and achieve real inference speedups.

Description

Modern NVIDIA GPUs execute tensor operations using Tensor Cores that process data in fixed-size tiles. If the number of channels in a convolution or linear layer is not aligned to these tile boundaries, the hardware must pad internally, wasting compute capacity. After structural pruning removes channels, the resulting odd channel counts can actually be slower than the original aligned counts despite having fewer parameters.

The round_to parameter in BasePruner forces the pruner to round the post-pruning channel count down to the nearest multiple of the specified value. This ensures hardware-friendly alignment is maintained throughout the pruning process.

Usage

Use this heuristic whenever you want to achieve real latency reduction from pruning, not just FLOPs reduction. It is especially important when:

Deploying pruned models for inference on NVIDIA GPUs
Measuring latency improvements with measure_latency()
Pruning Vision Transformers where head dimensions must remain divisible by num_heads

The Insight (Rule of Thumb)

Action: Set the round_to parameter when creating the pruner.
Value:
- round_to=8 for fp16 / float16 inference (Tensor Core alignment)
- round_to=4 for tf32 / float32 inference
- round_to=num_heads for Vision Transformers (ensures head_dim remains integer)
- round_to=4 for LLMs (used in prune_llm.py)
Trade-off: Slightly coarser pruning granularity; the actual pruning ratio may differ slightly from the target because channels are rounded to the nearest multiple.

Reasoning

NVIDIA Tensor Cores on Ampere (A100) and Hopper (H100) architectures operate on tiles of 8 (fp16) or 4 (tf32) elements. When channel counts are not multiples of these values, the GPU must pad with zeros, wasting bandwidth and compute. The NVIDIA Deep Learning Performance Guide explicitly recommends aligning channels/filters to multiples of 8 for optimal throughput.

For Vision Transformers, the total number of channels in QKV projections must be divisible by num_heads to compute per-head attention. Setting round_to=num_heads ensures the pruned model remains a valid Transformer.

Code Evidence

Rounding implementation from torch_pruning/pruner/algorithms/base_pruner.py:428-432:

def _round_to(self, n_pruned, current_channels, round_to):
    rounded_channels = current_channels - n_pruned
    rounded_channels = rounded_channels - rounded_channels % round_to
    n_pruned = current_channels - rounded_channels
    return max(n_pruned, 0)

ViT-specific rounding from reproduce/main_imagenet.py:153-154:

if 'vit' in args.model:
    round_to = model.encoder.layers[0].num_heads

fp16 vs tf32 rounding from examples/latency/measure_latency.py:93:

round_to=8   # for fp16
round_to=4   # for tf32

LLM rounding from examples/LLMs/prune_llm.py:329:

round_to=4

timm model rounding from examples/timm_models/prune_timm_models.py:101:

round_to=8

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment