Principle:Mit han lab Llm awq Per Channel Scaling Search
Overview
Per-channel scaling search is an optimization technique that finds per-channel scaling factors to minimize quantization error for groups of linear layers sharing the same input activation.
Description
Within each transformer block of a large language model, multiple linear layers often share a common input. For example, the query (Q), key (K), and value (V) projection layers all receive the same output from a preceding LayerNorm. Similarly, the gate and up projections in a feed-forward block share their input.
Per-channel scaling search exploits this structure by jointly optimizing the scaling for all linear layers in a group. The algorithm identifies these groups by tracing which linear layers connect to a common preceding operation (LayerNorm, another linear layer, or an activation function).
For each group, the search proceeds as follows:
- Compute activation statistics: From the cached calibration activations, compute the per-channel maximum absolute value, x_max, across all samples for the shared input.
- Grid search over scaling ratios: Iterate over candidate scaling ratios alpha from 0 to 1 (exclusive) in 20 evenly-spaced steps. For each alpha:
- Compute the scaling vector: s = x_max^alpha
- Apply the scaling to the preceding operation's output channels: prev_op.weight /= s
- Absorb the inverse scaling into the linear layers' input channels: linear.weight *= s
- Quantize the modified weights and measure the block-level output MSE against the unquantized block output
- Restore the original weights
- Select optimal alpha: Choose the alpha that produced the minimum MSE across all candidates.
The scaling is equivalent in full precision because dividing the preceding operation's output by s and multiplying the linear layer's weights by s cancel out. However, after quantization, different scalings redistribute quantization error differently across channels, and the optimal scaling concentrates precision on the channels that matter most.
This approach is efficient because:
- The grid search has only 20 candidates per group, making it fast
- The search operates at the block level, capturing interactions between layers within the block
- The scaling modifies existing operations (LayerNorm weights, linear biases) rather than introducing new parameters
Usage
Per-channel scaling search is used as a sub-step of the AWQ quantization pipeline. It is called once per transformer block, for each group of linked linear layers within that block.
Typical groups in a LLaMA-style transformer block include:
- Self-attention projections: LayerNorm -> Q, K, V projections
- Attention output: V output -> output projection
- Feed-forward gate/up: LayerNorm -> gate projection, up projection
- Feed-forward down: Up activation -> down projection
The search is invoked automatically by the AWQ orchestrator (run_awq) when the auto_scale flag is set to True.
Theoretical Basis
The mathematical formulation is:
s = x_max^alpha / norm_factor
where x_max is the per-channel maximum activation magnitude and norm_factor ensures numerical stability.
The scaling is applied as an equivalent transformation:
prev_op.weight /= s (divide preceding operation's output by s)
linear.weight *= s (multiply target linear layer's input by s)
The grid search evaluates:
alpha in {0, 1/20, 2/20, 3/20, ..., 19/20}
For each candidate, the block-level quantization error is measured:
MSE = ||block(x) - block_quant(x)||^2
where block(x) is the original block output and block_quant(x) is the output after applying the candidate scaling and quantizing the weights.
The optimal scaling ratio balances two competing effects:
- Higher alpha (closer to 1): Stronger scaling on high-activation channels, which reduces quantization error on salient weights but increases error on non-salient weights.
- Lower alpha (closer to 0): Weaker scaling, approaching the unscaled (round-to-nearest) baseline.
The grid search finds the point where the net effect on block-level MSE is minimized.
Related Pages
- Implementation:Mit_han_lab_Llm_awq_Auto_scale_block
- Heuristic:Mit_han_lab_Llm_awq_AWQ_Grid_Search_Tuning
Knowledge Sources
- Paper|AWQ|https://arxiv.org/abs/2306.00978
Domains
- Quantization
- Optimization