Workflow:VainF Torch Pruning Object Detection Pruning
| Knowledge Sources | |
|---|---|
| Domains | Model_Compression, Structural_Pruning, Object_Detection |
| Last Updated | 2026-02-07 23:30 GMT |
Overview
End-to-end iterative pruning and fine-tuning pipeline for YOLO object detection models (YOLOv5, YOLOv7, YOLOv8), progressively compressing the model while preserving detection accuracy.
Description
This workflow implements an iterative prune-then-finetune loop specifically designed for object detection models. Unlike single-shot pruning used for classifiers, detection models require careful iterative pruning with fine-tuning between each step to maintain detection quality (mAP). The workflow handles YOLO-specific challenges including replacing C2f modules with pruning-compatible variants, ignoring detection heads, managing progressive pruning ratios computed from a target total pruning rate, and integrating with the Ultralytics training pipeline for fine-tuning. Each iteration prunes a fraction of channels, evaluates mAP before and after fine-tuning, and includes an early stopping mechanism based on maximum allowed mAP drop.
Usage
Execute this workflow when you need to compress a YOLO detection model for deployment on edge devices, mobile platforms, or real-time inference scenarios. This is appropriate when you have a trained YOLOv5/v7/v8 model and need to reduce its computational cost while maintaining acceptable detection accuracy.
Execution Steps
Step 1: Load trained YOLO model and prepare architecture
Load the pretrained YOLO model from a checkpoint. For YOLOv8, replace C2f modules with a pruning-compatible C2f_v2 variant that splits the initial convolution into two separate convolutions, making the architecture amenable to structural pruning. Re-initialize batch normalization parameters and enable gradients for all parameters.
Key considerations:
- The C2f module uses chunk operations that are difficult to prune; C2f_v2 replaces these with explicit separate convolutions
- Weight transfer from C2f to C2f_v2 must correctly split the first convolution's weights by channel
- Initialize BN epsilon, momentum, and ReLU inplace settings after module replacement
Step 2: Establish baseline metrics
Run validation on the full (unpruned) model to establish baseline mAP, MACs, and parameter count. These serve as reference points for measuring compression progress and quality degradation across pruning iterations.
Key considerations:
- Use the same validation dataset and settings that will be used for post-pruning evaluation
- Record baseline_macs and baseline_nparams for computing compression ratios
Step 3: Compute per-iteration pruning ratio
Calculate the pruning ratio for each iteration such that after all iterations, the total pruning matches the target rate. The formula ensures equal proportional pruning at each step: ratio_per_step = 1 - (1 - target_rate)^(1/num_steps).
Pseudocode:
per_step_ratio = 1 - (1 - target_pruning_rate) ^ (1 / iterative_steps)
Step 4: Execute iterative prune-finetune loop
For each iteration: create a GroupNormPruner with the per-step pruning ratio, ignore detection head layers (Detect modules), execute pruning, validate mAP on the pruned (not yet fine-tuned) model, fine-tune for a configured number of epochs using the Ultralytics training pipeline, then validate again to measure recovered mAP. Track all metrics across iterations.
Key considerations:
- Ignore Detect modules to preserve the detection output structure
- Fine-tuning epochs per iteration are typically shorter than original training (e.g., 10 epochs)
- Delete the pruner after each iteration to free memory
- Early stopping if mAP drops below the allowed maximum drop threshold
Step 5: Export and visualize results
After completing all pruning iterations (or early stopping), export the final pruned model to ONNX format for deployment. Generate a performance visualization graph showing mAP recovery, pruned mAP, and MACs reduction across all pruning steps.
Key considerations:
- The performance graph plots recovered mAP, pruned mAP (before fine-tuning), and MACs on a dual-axis chart
- ONNX export enables deployment on various inference runtimes
- Compare final model size and speed against the baseline