Heuristic:VainF Torch Pruning Channel Rounding Alignment
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
Round pruned channel counts to multiples of 4 or 8 to maximize GPU hardware utilization and achieve real inference speedups.
Description
Modern NVIDIA GPUs execute tensor operations using Tensor Cores that process data in fixed-size tiles. If the number of channels in a convolution or linear layer is not aligned to these tile boundaries, the hardware must pad internally, wasting compute capacity. After structural pruning removes channels, the resulting odd channel counts can actually be slower than the original aligned counts despite having fewer parameters.
The round_to parameter in BasePruner forces the pruner to round the post-pruning channel count down to the nearest multiple of the specified value. This ensures hardware-friendly alignment is maintained throughout the pruning process.
Usage
Use this heuristic whenever you want to achieve real latency reduction from pruning, not just FLOPs reduction. It is especially important when:
- Deploying pruned models for inference on NVIDIA GPUs
- Measuring latency improvements with
measure_latency() - Pruning Vision Transformers where head dimensions must remain divisible by
num_heads
The Insight (Rule of Thumb)
- Action: Set the
round_toparameter when creating the pruner. - Value:
round_to=8for fp16 / float16 inference (Tensor Core alignment)round_to=4for tf32 / float32 inferenceround_to=num_headsfor Vision Transformers (ensures head_dim remains integer)round_to=4for LLMs (used in prune_llm.py)
- Trade-off: Slightly coarser pruning granularity; the actual pruning ratio may differ slightly from the target because channels are rounded to the nearest multiple.
Reasoning
NVIDIA Tensor Cores on Ampere (A100) and Hopper (H100) architectures operate on tiles of 8 (fp16) or 4 (tf32) elements. When channel counts are not multiples of these values, the GPU must pad with zeros, wasting bandwidth and compute. The NVIDIA Deep Learning Performance Guide explicitly recommends aligning channels/filters to multiples of 8 for optimal throughput.
For Vision Transformers, the total number of channels in QKV projections must be divisible by num_heads to compute per-head attention. Setting round_to=num_heads ensures the pruned model remains a valid Transformer.
Code Evidence
Rounding implementation from torch_pruning/pruner/algorithms/base_pruner.py:428-432:
def _round_to(self, n_pruned, current_channels, round_to):
rounded_channels = current_channels - n_pruned
rounded_channels = rounded_channels - rounded_channels % round_to
n_pruned = current_channels - rounded_channels
return max(n_pruned, 0)
ViT-specific rounding from reproduce/main_imagenet.py:153-154:
if 'vit' in args.model:
round_to = model.encoder.layers[0].num_heads
fp16 vs tf32 rounding from examples/latency/measure_latency.py:93:
round_to=8 # for fp16
round_to=4 # for tf32
LLM rounding from examples/LLMs/prune_llm.py:329:
round_to=4
timm model rounding from examples/timm_models/prune_timm_models.py:101:
round_to=8