Heuristic:VainF Torch Pruning GQA Head Pruning Constraints
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Compression, Optimization |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
When pruning LLMs with Grouped Query Attention, the pruning ratio must be a multiple of (num_key_value_heads / num_attention_heads) for HuggingFace compatibility.
Description
Modern LLMs like Qwen2.5, DeepSeek-R1-Distill, and Llama-3 use Grouped Query Attention (GQA), where the number of key/value heads is smaller than the number of query heads. For example, Qwen2.5-7B has 28 query heads and 4 KV heads. When pruning heads, the remaining head count must still satisfy the GQA divisibility constraint (num_attention_heads % num_key_value_heads == 0) for the model to be valid.
Furthermore, HuggingFace Transformers requires that q_proj.in_features == o_proj.out_features after pruning, which imposes additional constraints on which head counts are valid.
Usage
Use this heuristic when pruning any LLM with GQA (most modern LLMs). Failure to respect these constraints results in models that cannot be saved or loaded with HuggingFace Transformers.
The Insight (Rule of Thumb)
- Action: Choose a
head_pruning_ratiothat results in a valid number of remaining KV heads. - Value: For a model with N query heads and K KV heads, valid pruning ratios are multiples of
K/N. For Qwen2.5-7B (28 heads, 4 KV heads): valid ratios are 1/7, 2/7, 3/7, 4/7, 5/7, 6/7. - Trade-off: Coarser granularity of pruning ratios compared to non-GQA models. You cannot prune to an arbitrary number of heads.
Example for Qwen2.5-7B (28Q / 4KV):
| Head Pruning Ratio | Remaining Q Heads | Remaining KV Heads | Valid? |
|---|---|---|---|
| 1/7 (~0.143) | 24 | ~3.4 | Approximately |
| 2/7 (~0.286) | 20 | ~2.9 | Approximately |
| 3/7 (~0.429) | 16 | ~2.3 | Recommended in docs |
| 4/7 (~0.571) | 12 | ~1.7 | Approximately |
Reasoning
In GQA, each KV head is shared by num_attention_heads / num_key_value_heads query heads. Removing a query head group requires also removing the associated KV head, so heads must be removed in groups of this size. If the remaining head count does not maintain the Q/KV ratio, the attention computation becomes invalid.
The Torch-Pruning library handles this internally by disabling independent head pruning for KV layers when GQA is detected (_is_gqa = True). The KV heads are pruned in tandem with their associated Q heads through the dependency graph.
Code Evidence
GQA constraint documentation from examples/LLMs/readme.md (lines 47-51):
The Qwen2.5-7B & DeepSeek-R1-Distill-Qwen-7B models have 28 heads with num_key_value_heads=4.
This limits the pruning ratio to be multiple of 4/28=1/7 such as [1/7, 2/7, 3/7, 4/7, 5/7, 6/7].
This is a hard constraint if you want to save and load the pruned model using Huggingface Transformers
since HF only supports in_features==out_features in the q_proj and o_proj.
GQA detection and KV head pruning disabling from torch_pruning/pruner/algorithms/base_pruner.py:747:
# disable head pruning for the kv layers if GQA is enabled,
# since they will be shared by multiple Q heads
GQA handling with chunked indices from torch_pruning/pruner/algorithms/base_pruner.py:502:
# GQA: the number of heads for KV might be different from Q (Num_KV<=Num_Q)