Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Mit han lab Llm awq Skip QK Projection Clipping

From Leeroopedia
Knowledge Sources
Domains Quantization, LLMs
Last Updated 2026-02-15 01:00 GMT

Overview

Skip weight clipping for Query and Key projections during AWQ auto_clip because precise clipping is unreliable due to query-key batch matrix multiplication interactions.

Description

During the weight clipping optimization phase of AWQ, the `auto_clip_block` function intentionally skips all Query (Q) and Key (K) projection layers. This is because Q and K weights interact through a batch matrix multiplication (BMM) in the attention mechanism (`QK^T`), making it difficult to independently clip their weight ranges without unpredictable effects on the dot-product attention scores. Value (V) projections and other linear layers (MLP, output projections) are clipped normally since their outputs flow through element-wise operations where independent clipping is well-defined.

Usage

This heuristic is applied automatically in `auto_clip_block()`. Be aware of it when extending AWQ to new model architectures: any linear layer whose name contains `q_`, `k_`, `query`, `key`, or `Wqkv` will be excluded from clipping. If your model uses different naming conventions for Q/K projections, you may need to update the skip list.

The Insight (Rule of Thumb)

  • Action: Skip clipping for layers matching names `["q_", "k_", "query", "key", "Wqkv"]`
  • Value: These projections receive no clipping optimization (original weight ranges preserved)
  • Trade-off: Slightly less aggressive quantization for Q/K projections in exchange for more stable attention computation. The quantization error in V projections and MLP layers contributes more directly to output noise, making clipping there more beneficial.

Reasoning

In a Transformer attention block, the Q and K projections produce vectors that are multiplied together via `softmax(QK^T / sqrt(d))`. Clipping Q weights changes the magnitude of query vectors, and clipping K weights changes key vector magnitudes. Because these interact multiplicatively in the attention score, clipping one without accounting for the other can cause unpredictable shifts in attention patterns. The MSE-based clipping search in `auto_clip_layer` measures error at the individual layer output level, which does not capture these cross-layer BMM interactions. Skipping Q/K clipping avoids this issue entirely while still applying clipping to the majority of linear layers (V projection, output projection, MLP up/down/gate projections).

This decision is referenced in GitHub PR #67 for the AWQ repository, where attention output projection scaling was also found to sometimes hurt accuracy.

# From awq/quantize/auto_clip.py:73-76
for name in named_linears:
    # due to qk bmm, it is hard to clip precisely
    if any([_ in name for _ in ["q_", "k_", "query", "key", "Wqkv"]]):
        continue
# From awq/quantize/auto_scale.py:231-239
# Similarly, attention output scaling is commented out for some architectures
# "Please refer to https://github.com/mit-han-lab/llm-awq/pull/67"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment